Measure Theory and Probability 



2012 



P. Ouwehand 



Department of Mathematical Sciences 
Stellenbosch University 



Contents 



II Events and Probabilities! 

11.1 Motivation! 

1.1.1 Probabilistic Modelling 

1.1.2 Length, Area and Probability! • • • • 

11.2 Events! 

11.2.1 Structure of Events! 

1.2.2 Events and c-algebras 

11.3 Measures! 

I. 4 Properties of Measures 

1.4.1 Additivity properties 

1.4.2 Limit Operations on Sets 

11.4.3 Limits of Sets and Measures! 

II. 5 Extension of Measures! 

11.5.1 Other Families of Sets! 

11.5.2 The Extension Theorem! 

1.6 Lebesgue Measure 

1.6.1 Lebesgue measure on K 

1.6.2 Lebesgue Measure on M. d 

1.6.3 Lebesgue Measure and Coin Tossing 



12 Random Variables! 

12.1 Definition of Measurable Function! 

12.2 Combinations of Measurable Functions! 

2.3 Approximation by Simple Functions 

2.4 Pushing Measures along Functions: The Laws of a Random Variable 



3 Information and Independence 

3.1 Conditional Probability and Independence of Events 

13.2 Borel-Cantelli Lemmas! 

13.3 Information in Random Variables! 

3.4 Independence of cr-algebras and Random Variables . 



4 Integration and Expectation 

4.1 The Integral: Definition and Basic Properties 

4.2 Lebesgue's Dominated Convergence Theorem 
14.3 Measure Zero! 



4.4 Riemann Integral vs. Lebesgue Integral 

4.5 Chain Rule, Change of Variables . . . 

4.6 Definition of Expectation 

4.7 Inequalities 



5 Products and Independence 



5.1 


Product Spaces 


15.1.1 Introduction! 




5.1.2 Products of Measure Spaces 


5.2 


IndeDcndcnce 




6 Spaces of Random Variables 


|6.1 


Topological Vector Spaces| 




6.1.1 Normed Vector Spaces| 




6.1.2 Inner Product Spaces 




6.1.3 Orthogonal Projection in Hilbert Spaces 


6.2 


IP Spaces| 


6.3 


L p -Spaces and Probability 




7 Conditional Expectation 


7.1 


Definition of Conditional Expectation 




7.1.1 Conditioning on an Event 




7.1.2 Conditioning on a cr-Algebra in a Discrete Probability Space 




7.1.3 Conditioning on a cr-Algebra in a General Probability Space 


7.2 


Properties of Conditional Expectation 


7.3 


Radon-Nikodym Theorem 



8 Convergence in Probability Theory 

8.1 Convergence of Random Variables and Measures . . . 

8.1.1 Modes of Convergence 

8.1.2 Uniform Integrability 

8.2 Weak Convergence and Convergence in Distribution . 
18.3 Characteristic Functionsl 

8.3.1 Basic Properties 

18.3.2 Inversion! 

8.3.3 Weak Convergence and Characteristic Functions 
1.1 The Central Limit Theorem! 



A Sets, Logic and Functions 

A.l Logic and Formal Language] 

A. 1.1 Symbols denoting Objects, Operations and Relations 

A. 1.2 Logical Connectives! 

A. 1.3 Quantifiers 

A. 2 Sets, Functions and Relations 

A. 2.1 Operations on sets 

IA.2.2 Functionsl 

A.2.3 Functions Operating On Sets 



Contents 



111 



IA.2.4 Rclationsl 157 

I A. 3 Countable and Uncountable Setsl 159 



Chapter 1 

Events and Probabilities 



1.1 Motivation 

1.1.1 Probabilistic Modelling 

A model for an experiment involving randomness takes the form (f^J 7 , P). Intuitively, 0, 
is the set of all possible outcomes of the experiment, and is called the sample space. T is 
the set of all events, i.e. "permissible" combinations of outcomes. (We shall soon see that 
not all combinations of events are permissible; indeed, it is impossible to create a consistent 

mathematical theory of probability in which every set of outcomes is permissible.) P is a map 

p 

T — > [0, 1] which assigns to each (permissible) event its probability. 

Example 1.1.1 A die is rolled once. The possible outcomes are an integer between one and 
six. Thus the sample space can be taken to be = {1, 2, ... , 6}. We may be interested in 
the following events: 

(a) The outcome is the number 1; 

(b) The outcome is an even number; 

(c) The outcome is an odd number which is strictly greater than 1; 

Each of these events can be described by a subset of the sample space. Thus if A, B, C are 
the subsets corresponding to the events (a), (b), (c), then 

A = {1} 

B = {2, 4, 6} 

C = {3,5} 

The probabilities of these events, by elementary reasoning, are F(A) = 1, F(B) = \ and 
P(C) = |, provided that the die is fair. Every subset of is a "permissible" event, and thus 
T = V(Sl). 

□ 

Mathematically, an event is a set, i.e. events are just subsets of the sample space. The 
outcome of any random experiment must be some element uj of the sample space Q. Now £1 is 



1 



2 



Motivation 



itself a subset of £1 and thus corresponds to some event. We call it the certain event, since we 
are certain that to € ft. We must always have P(f2) = 1. The empty set is also a subset of 
Q and thus corresponds to some event. We call it the impossible event, since it is impossible 
that an outcome ui is in 0. We will always have P(0) = 0. 

Note that the sample space corresponding to a random experiment need not be unique. 
Consider, for example, the random experiment of rolling two dice. Then we can choose the 
sample space to be the 36 element set Q.\ = ■ i,j positive integers between 1 and 6}. 

The probabilities for each outcome are then the same: F(oj) = ^ for each co £ fii. This is a 
so-called uniform distribution. On the other hand, we can choose the sample space to be the 
11— element set 0,2 = {2, 3,4,..., 12} corresponding to the total of the two dice. In this case, 
the probability distribution is non-uniform: P({7}) = | whereas P({2}) = ^. Choosing the 
sample space and the corresponding probability distribution for a particular situation is part 
of the art of probabilistic modelling. 

Example 1.1.2 A coin is flipped until the first head turns up. This may happen on the first 
toss or the second, or. . . or never. Thus the sample space is Q = {uj\, 0J2, • • • , ^00}, where the 
outcome ui n denotes the event that the first head turns up on the n th toss, and denotes 
the event of never flipping heads. It is clear from elementary probability that P({u; n }) = ^ 
(provided that the coin is fair). We may now consider various composite events, such as: 

(a) Let A be the event that the first head appears on either the third or the fourth toss. Then 
A = {w 3 } U {w 4 } = {^3,^4}- Clearly F(A) = ± + 

(b) Let B be the event that the first head appears after an even number of tosses. Thus 

00 

B = UneNi^n} an d P(.B) = 2^" = I" ^*id ^ ou think that the probability that the 

n=l 

first head appears after an even number of tosses is |? If so, note that the probability 
that the first head appears on the first toss is and the probability that the first head 
appears after an odd number of tosses is therefore greater than \ . 

(c) Let C be the event that both events A and B occur. Clearly C = {^4} = A D B. 

(d) Let D be the event that a head does occur after a finite number of tosses. Thus D is the 

complement of the event that heads never occurs. Thus D = £1 — {uJoo} = {oj\,oo2, . . . }. 

00 

Hence F(D) = ^ ^ = 1. This can also be seen from the fact that P({woo}) = 0. 

71=1 

□ 

1.1.2 Length, Area and Probability 

"Length" is a non-negative real number associated with certain subsets of the real line R, i.e. 
it is a function £(■) which assigns to a subset E C R its length C(E). Similarly, "area" is a 
non-negative real number associated with certain subsets of the Euclidean plane M 2 , i.e. it is 
a function A(-) which assigns to a set £C1 2 its area A{E). Intuitively, the length and area 
functions should have certain properties: 



Events and Probabilities 



3 



l£. If F C M is bounded, then < £(F) < oo. 

2c- £(•) is additive: If F, F are disjoint bounded subsets of R, then £(F U F) = £(F) + 

More generally, £(•) is countably additive: if Fx, F2, . . . F n , ... (n G N) is a sequence 
of mutually disjoint bounded subsets M, then C([J^ =1 E n ) = Y^=i £(F n ). 

3/;. £(•) is translation invariant: If a set F is obtained by shifting a set F, then £(F) = 
UBS). 

4c- If F is a interval, then £(F) = length of interval. 



I A- If F C R 2 is bounded, then < A{E) < 00. 

2^4. A(-) is additive: If F, F are disjoint bounded subsets of M 2 , then .A(FUF) = -4(F) + 
A(F). 

More generally, A(-) is countably additive: if Fi, F2, . . . E n , ... (n G N) is a sequence 
of mutually disjoint bounded subsets R 2 , then >l(|J^ =1 F n ) = Yln=i >4(F n ). 

3^4. .A(-) is rotation- and translation invariant: If a set F is obtained by rotating and/or 
shifting a set F, then .4(F) = A{E). 

4_4. If F is a rectangle, then -4(F) = length x breadth. 



From the above properties, it follows easily how to calculate the area of triangles, and 
thus that of polygons. But what about non-polygonal sets? For example, how do we justify 
that the area of a circle of radius r is 7rr 2 ? 

Example 1.1.3 We use the properties of the area function to show that the area of an operj^] 
circle A of radius r is \A\ = nr 2 . 

For n = 1, 2, 3, . . . , let A n be the regular open polygon with 2 n+1 sides, inscribed in a circle 
of radius r. Thus A\ is a square, ^3 an octagon, etc. Note that A% C A2 C A3 C . . . . Also 
note that U^Li ^ n = A. Each A n consists of 2 n+1 congruent isosceles triangles, constructed 
by joining each of the sides of the polygon to the centre of the circle. It is easy to see that each 
such triangle has area \r 2 sin ^r, and we may therefore conclude that A(A n ) = 2 n r 2 sin 

We now show that -4(4) = lim A(A n ). This may be obvious intuitively, but requires 

n— >oo 

the countable additivity property stated above: Define a sequence of sets Bi, B2, . ■ ■ , B n , . . . 
inductively by 

B\ := A\ B n+ \ := A n+ \ — A n 
and note that the B n are mutually disjoint and that 

n 00 

A n ={jB k A=\jB n 

k=l n=l 

1 If you don't know any topology, then intuitively a subset A C 1™ is open when it doesn't contain any of 
its boundary points. 



4 



Motivation 



Then we have 

oo oo n 

A(A) = A(\J B n ) = Y^A{B n ) = lim Y^A{B k ) = lim A(A n ) 

^— < n— >oo £ — * n— >oo 

rt=l n=l k=l 

We therefore conclude that .4.(^4) = lim n nr 2 ( ^r 2 " I = 7rr 2 . 



□ 



The technique used to compute the area in Example 1.1.3 relies on the set A being 
approximated from the inside by triangles. Not every set can be so approximated, however. 
Take for example the set E := {(p,q) : p, q are rational numbers with < p, q < 1}. It is 
not clear what the area of E should be: On the one hand, the set E is dense inside the unit 
rectangle, so one might guess that A(E) = 1. If we approximate E from the outside however, 
we obtain a convincing argument that A(E) = 0: 

Example 1.1.4 Recall that the set of rational numbers is countable. So we can write the 
elements of E in a list: A = {(p n ,qn) '■ n £ N}. Fix e > 0. For n G N, let R n be a square 
centred at (p n , q n ) with sides Let B := U^Li Rn- Then clearly A(B) < 2~^^Li A(R n ) = s. 
Furthermore, E C B so that A(E) < e. Since e > was arbitrary, we see that A(E) < e for 
all e > 0, and thus we must have A(E) = 0. 

□ 



Observe that the technique used in Example |1.1.4| can be used to prove that every count- 
able subset of M 2 has zero area. It is therefore necessary that the set of real numbers be 
uncountable for the concept of area to have a useful meaning! Similar statements are true 
about the notion of length. 

Not only intervals or unions of intervals have length. Some quite complicated sets can be 
measured. Consider the following example: 



Exercise 1.1.5 The Cantor set: 

The Cantor set C is a subset of [0, 1] which is constructed as follows: Let Cq := [0, 1]. Now 

mi 



let C\ be Cq with its middle third removed, i.e. C\ := [0, i] U [§,!]• Now remove the middle 



thirds of the two intervalsthat make up C\ to form C2, i.e. C2 := [0, g] U [|, U [|, |] U [|, 1]. 

Continue in this way, removing the middle thirds of each of the intervals comprising C n to 

00 

form C n +i. Finally, let C := f] C n . C is the Cantor set. 

n=0 

(a) Explain why C is a Borel set. 

(b) Show that A(C) = 0. [Hint: First calculate A(C n )] 

(c) Every real number a € [0, 1] has a non-terminating ternary (base 3-) expansion a = 
(0.010203 . . • )3 := ^2 — , where Oj = 0, 1 or 2 — cf. remarks on base n-expansions in the 

i=l 3* 

box below. Show that the Cantor set is formed by removing all numbers which have a 1 
occurring in their non-terminating ternary expansion. 

[Hint: Show that C\ is formed by removing all numbers which have a 1 in the first digit, 
C2 is formed by removing all numbers in C\ which have a 1 in the second digit, and so 
on.l 



Events and Probabilities 



5 



(d) Show that there are as many elements in C as there are real numbers. Conclude that C 
is uncountable. [Hint: Define / : C -> M so that /((0.202202 . . . ) 3 ) = (0.101101 . . . ) 2 .] 

□ 

Base n— expansions: We are familiar with the fact that every real number has a base 
10-expansion, i.e. a decimal expansion using digits in {0, 1, . . . ,9}, e.g. 

vr = 3.14159265 ••• = T §t + I ^ + I ^ + I ^ + T §* + -- - 

In exactly the same way, every real number has a base n-expansion, for every integer 
n > 2, using digits in {0, 1, . . . , n — 1}. For example 

(0.10110) 2 = ^ + ^ + ^ + ^ = 11 = (0.6875)io 
(0.2121212 ...) 3 = | + ^ + J r + ^ + ...= 7 = (0.875)io 

Such an expansion is typically unique, except for one small problem: (0.(n — l)(n — l)(n — 
1) ■ • • )n = YlkLi ^ = ^ ' jzx = 1 = (1-000 ...)„, i.e. every number whose base 

n 

n-expansion ends in zeroes also has one that ends in (n — l)'s. For example, 

(0.9999)io • • • = (1.0000)io • • • (0.011111 . . . ) 2 = \ = (0.1000 . . . ) 2 
(0.102222 ...)3 = i = (0.110000)3 

Say that an expansion is terminating if it eventually ends in all zeroes. Then every real 
number has a unique non-terminating base n-expansion. 

Using an argument due to Vitali in 1905, we show that it is impossible to assign a length 
to every bounded subset of R, i.e. there is no function which satisfies each of the properties 
(l)-(4) of £(•) above, and which is defined for every bounded subset of M. Thus there are 
subsets of M. which have no length. This does not mean that these sets have zero length; it 
means that there is no number which can be called their length, and which is consistent with 
(l)-(4). 

Example 1.1.6 Define an equivalence relation ~ on R by 

x ~ y -s=>- y — x G Q 

Let {Ei : i G 1} be an enumeration of the equivalence classes of ~. Note that if x G R, then 
there exists q G Q such that < x + q < 1. Now since x ~ x + q, we see that for every x 
there is y G [0, 1] such that x ~ y. Thus [0, 1] Pi Ei ^ for every i £ I. 

Now piclsrlfor each i £ / one xi G [0, 1] n E{, and define a Vitali set H by H := {xi :G /}. 
Thus for each y G M there is a unique i G / such that y ~ Xj. 

For q G Q, define H + q := {xi + q : i G I}. First note that the H + q are mutually disjoint: 
For if y G (H + q) D (iJ + g') for rational numbers g, g', then y = + g = Xj + g' for some 
i,j G /, and thus Xi = Xj + (g' — q), i.e. Xj ~ Xj. It follows that = Xj, thus that g = q' , 
and thus that H + q = H + q' '. 

Next, we claim that for each y G R there is a unique g G Q such that y G H + q := {xj + g : 
i G /}. Indeed, existence follows from the fact that there is an i G I such that y ~ Xj, so that 



2 This requires the Axiom of Choice. 



6 



Motivation 



q := y — Xi has the property that y G H + q. Uniqueness follows from the disjointness of the 
H + q. 

Now let {q n : n G N} be an enumeration of Q n [— 1, 1]. Note that if x G [0, 1], there is a 
unique i £ I such that x — Xi G Q fl [—1, 1], so that x G |J ngN (i? + g n ). Since H C [0, 1], we 
also have UneN^ + In) ^ [— 1) 2]. Thus: 

[0,1] C[j(ff + 9 „)C [-1,2] 
neN 

Now suppose that the Vitali set H has an length i.e. that C(H) exists. Each H + q n is a 
translation of and thus C(H+q n ) = C(H) for all n G N. Now since [0, 1] C \Jn=i( H+( ln) C 
[—1,2], it follows that 

1 <£ ^Q(# + <7n)^ <3 

Furthermore, the sets H + q n are mutually disjoint, so 

(oo \ oo oo 

U(H + q n ) \=Y J £(H + q n )) = Y,£(H) 
n=l / n=l n=l 

The fact that 1 < Y,n=i C ( H ) implies that C{H) > 0, whereas the fact that Y%=i C ( H ) ^ 
3 < oo implies that C(H) ^0 — contradiction. Hence H C R is a bounded set to which a 
length cannot be assigned. 

□ 



Remarks 1.1.7 Just so, some subsets of R 2 fail to have an area, and some subsets of M 3 
fail to have a volume. This is illustrated by the following theorem, called the Banach-Tarski 
paradox: It is possible to break up a solid ball the size of a pea into finitely many pieces, and 
to rearrange these pieces to form a solid ball the size of the sun (or into pretty much any 
shape of any size that you desire^ 

□ 



Here are some points to consider: 

• To calculate the lengths/areas of sets more complicated than intervals/rectangles, it 
is necessary to approximate these sets from above and/or below, and to take limits. 
The countable additivity of the length/area function is what permits the taking of 
limits. 

• Not every subset of M/M 2 can be assigned a length/area. It will therefore be nec- 
essary to exclude these non measurable sets from consideration: They are not 
"permissible" . 

3 No, really, this is a theorem. However, the proof of this theorem also depends on the Axiom of Choice, 
an axiom about sets that most people find obvious, but that mathematicians of the early twentieth century 
deemed highly controversial. This axiom states that if you have a family of non-empty disjoint sets A, (i G J), 
then there is a set B which contains exactly one element from each A%. 



Events and Probabilities 



7 



Now note that length and area may be considered as special cases of probability, namely 
uniform probability. 

Example 1.1.8 Consider the experiment of randomly choosing a number to from the unit 
interval [0, 1] with equal probability. 

• The probability that < uj < \ is The event that < uj < | corresponds to the set 
[0, |]. Its length is £([0, |]) = |. 

• Similarly, if (a, 6) C [0, 1], then the probability that to G (a, 6) is P((a, 6)) = £((a, &)) = 
b — a. 

• Thus if E C [0, 1], then the probability that uj G -E is P(£) = £(£). 



In particular, if iif is the Vitali set of Example 1.1.6 then the probability that uj G is 
undefined — it is impossible to meaningfully assign a probability to H while maintaining 
translation invariance. H is not "permissible". 

□ 



Similarly, consider the experiment of choosing a point to from the unit square [0, 1] x [0, 1] 
with equal probability. If E C [0, 1] x [0, 1], then the probability that to G E is just A(E). 



1.2 Events 

1.2.1 Structure of Events 

In order for our mathematical theory of probability to bear some resemblance to the real 
world, it is clear that we should be able to combine events in the following ways: 

• If A is an event, then the possibility of A not occurring should also be an event. Now 
if the outcome of a random experiment is uj G f2, then the event A occurs if and only if 
to G A (remember that we consider an event to be a subset of the sample space) . Thus 
the event that A does not occur corresponds to uj g" A, i.e to the set A c = £1 — A. 

• If A, B are events, then the possibility of both A and B occurring should also be an 
event. Now if the outcome of a random experiment is uj G f2, then both A and B occur 
if and only if to G A and to G B, i.e. if and only if uj G A n B. Thus the event of both A 
and B occurring corresponds the set A n B. 

• If A and B are events, then the possibility of at least one of A or B occurring should be 
an event as well. This corresponds to the set A U B. We say that events are disjoint or 
mutually exclusive if they cannot occur simultaneously. Thus if A, B are disjoint, the 
to G A implies to G" B. Clearly, therefore, A and B are disjoint if and only if A n B = 
(i.e. the event that both A and B occur is impossible). For disjoint events A and B, 
we want F(A U B) = F(A) + F(B). 

• In fact we demand more: Given a countable list of events A±, A2, A3, . . . , the possibilities 

of either all of these events occurring, or of at least one of these events ocurring, should 

00 00 

also be events. They correspond to the sets f] A n and (J A n respectively. If the 

71=1 71=1 



8 



Events and a -algebras 



events A n are mutually disjoint, i.e. if A n n A m = whenever n / m, then we want 

oo oo 

n U 4.) = £ HAn) 
n=l n=l 

Note that since P(fi) = 1, we have P(A U ,4 C ) = 1. By disjointness, U A c ) = F(A) + 
P(,4 C ), and hence P(,4 C ) = 1 - P(A). 

As you will note, the concept of probability has rather a lot in common with those of length 
and area. For example, 

• "Probability" is measured by non-negative real number P(E) assigned to a subset E of 
the sample space Q. 

"Area" is measured by non-negative real number A{E) assigned to a subset E of R 2 . 

• P(0) = 0; 
A{$) = 0. 

• If E n are disjoint events, then P((J n E n ) = ^„P(£ n ); 

If E n are disjoint subsets of M 2 , then A(\J n E n ) = ^2 n A(E n ). 

When we isolate and study the common features of probability and area, we get the 
subject of measure theory. We shall show that we can develop a theory which allows us to 
form integrals J f dfx of functions / with respect to measures (rather than variables). It 
will turn out that the integral with respect to Lebesgue measure (yet to be defined) will be 
a powerful generalization of the ordinary Riemann integral. It will also transpire that the 
integral with respect to a probability measure precisely captures the notion of probabilistic 
expectation. 

Armed with the intuition and motivation provided by the above discussion and examples, 
we may now proceed with the formal theory. 

1.2.2 Events and a-algebras 

To model a random experiment, we need to define three objects: 

• A sample space Q,, representing the possible outcomes of the experiment. The outcomes 
u £ Q are called sample points. 

• A family T of events. 

An event is a (permissible/relevant) subset of fl. If A is an event, we say that A occurs 
if the outcome uj is an element of A. 

We shall require J 7 to be a a-algebra (which we define below). 

• A probability measure P which assigns to each event a probability. 
Thus P : T ^ [0,1]. 

The triple (0, J 7 , P) will be called a probability space (subject to certain conditions on T and 
P). 

Recall that an event £ £ J is said to have occurred if the outcome u of the random 
experiment belongs to E. Intuitively, we think of J 7 as a set of events E for which we can 
decide whether or not E occurred at the termination of the experiment. Note: . . .whether 
or not. . . . 

This intuition imposes the following constraints on T: 



Events and Probabilities 



9 



(a) fleJ and G T . 

Indeed, every outcome u belongs to S7, and thus the event f2 always occurs — it's the 
certain event. 

Similarly, no outcome oj belongs to 0, and thus the event never occurs — it's the 
impossible event. 

(b) If E G J 7 , then E c G J 7 , i.e. T is closed under complementation. 

For if we can decide whether or not E occurred, then we can also decide whether or not 
E c occurred: For suppose that the outcome of the experiment is oj. If E occurred, then 
oj G E, so oj G" E c , hence E c did not occur. 
Similarly, if E did not occur, then E c did occur. 

(c) If E\,Ei G J 7 , then E\ n E% G J 7 , i.e. J 7 is closed under intersection. 

For if we can decide whether or not E\ occurred, and also whether or not E2 occurred, 
then we can decide whether or not E\ n Ei occurred: E\ n E2 occurred iff oj G E\ n ^2 
iff w G -Ei and a; G -E2 iff &oi/i i?i and E2 occurred. 

Thus if we can decide whether or not Ei, E2 occurred, we can also decide whether or not 
Ei n E 2 occurred. 

(d) Similarly, T is closed under union: The event E\ U £2 occurs iff either E\ occurred, or 
E2 occurred (or both). 

(e) We can generalize (c) and (d) somewhat: If Ei, E2, E3, . . . , E n , . . . , is a countable sequence 
of members of J 7 , then also f] n E n G T and |J n E n G J 7 , i.e. T is closed under countable 
intersections and -unions. 

For f] n E n occurred iff each of the E n occurred, and [j n E n occurred iff at least one of 
the E n occurred. Thus if we can decide whether or not each E n occurred, we can also 
decide whether or not f] n E n and \J n E n occurred. 

This leads to the following definitions: 

Definition 1.2.1 Let SI be a set. A collection A of subsets of S~l is called an algebra (or 
field) on $7 if 

(i) G A; 

(ii) A G A A c G A; 

(iii) A,B€A=> AUB€A. 

A collection T of subsets of ft is called a a-algebra (or a -field) if it satisfies (i), (ii) and 
(iii) CT If A n G T (for n G N), then \J n A n G J 7 . 

Remarks 1.2.2 (i) An algebra is closed under (finite) intersections, and a <r-algebra is 
closed under countable intersections: 




(ii) If A is an algebra and A, B G A, then A 



B and AAB belong to A. 



10 



Events and a -algebras 



(iii) If U is a set, then := {0, Q} is the smallest cr-algebra on S7, and J 7 ^ := V(VL) is the 
biggest cr-algebra on Q. 

(iv) If {Fi : i G J} is a family of cr-algebras on fi, then T := f) ieI Ti is also a cr-algebra on 

□ 

Events are organized in cr-algebras. The set-theoretic operations [j, f], - c , correspond 
to logical combinations or, and, not of events. 

Frequently, the events of interest form a collection C which is not a cr-algebra. Suppose 
that C is a collection of events which can be decided, i.e. if E G C, then we can decide whether 
or not E occurred. We can then also decide whether or not E c occurred, but E° may not be 
an element of C. The bigger set T of all events that can be decided, given that we can decide 
all the events in C, is a cr-algebra containing C. 

Definition and Proposition 1.2.3 Let C be a family of subsets of ft. There exists a 
unique smallest a -algebra T which contains C (i.e. C C T , and if Q is any a-algebra such 
that C C Q, then T C. Q also). 

T is called the a-algebra generated by C, and denoted by 

F = a{C) 

Proof: Let F = {Q : Q a cr-algebra with C C Q}, and let T = f]¥. Then F G F. (Why?) 
Moreover, if Q is a cr-algebra which contains C, then Q G F, and so T C Q. (Why?) 

H 

We repeat the following important intuition 

cr(C) consists of all those events F for which we can decide whether or not F has occurred, 
given that we know exactly which of the E G C have occurred. 

Exercise 1.2.4 Let Q = (0, 1], and let A be the family of all those sets which can be written 
as a union of finitely many intervals of the form (a, b], where < a < b < 1. Show that A is 
an algebra, but not a cr-algebra. 

□ 

Exercise 1.2.5 Let be a set. A partition of 17 is a collection of pairwise disjoint subsets 
{Bi : i G /} whose union is i7, i.e. 

Bi nBj = H) when i^j [J B { = ft 

iei 

The Bi are called the blocks of the partition. 

(a) Suppose that Q, is a set, and that B := {B n : n G N} is a partition of £1 with countably 
many blocks. Show that the cr-algebra cr(B) generated by B is the precisely family of 
those sets that can be written as countable unions of the blocks B n . 
[Hint: Consider union and complements of Bi U B4 U Bq U . . . and B2 U S3 U B$ U B7 U 
B n U . . . .] 



Events and Probabilities 



11 



(b) Show that if f2 is a countable set, and if J 7 is a u-algebra on f2, then there is a countable 
partition B := {B n : n G N} which generates J 7 . 

[Hint: Define a relation ~ on 0, by 

u ~ J O V.F G J 7 ^ G F O u/ G F] 
Show that ~ is an equivalence relation, and consider the equivalence classes of ~.] 

(c) In a gambling game, a die is rolled. The sample space is Cl = {1, 2, ... , 6}. I will tell you 
whether or not the outcome is even, and whether or not the outcome is < 4. What is the 
o"-algebra on £1 which contains this information? 

(d) Suppose that T = a(B), where B is a partition consisting of n blocks. Explain why T 
has exactly 2" = 2 no - of blocks elements. 

□ 



Definition 1.2.6 If the sample space Q is a topological space, we define the Borel algebra 
of by 

B(Q) = cr(open sets of O) 

In particular, B(M) is the smallest <r-algebra on M. which contains all the open subsets of 

EL 

B(M) is one of the most important c-algebras. 
Exercise 1.2.7 Prove that the following sets belong to B(1R): 

(i) All closed intervals [x, y], where x < y G M. 

(ii) The half-open intervals (x, y] and [x, y), where x < y G M. 
(hi) Every singleton {x}, where x G K. 

(iv) Every countable subset of M. 

(v) The sets Q of rational numbers and Irr of irrational numbers. 



(vi) The Cantor set C described in Example 1.1.5 



□ 



Proposition 1.2.8 Define 

C = collection of all intervals of the form (oo, a;], where x G 
Then a(C) = B(R). 



Proof: As (-oo,x] = (XLi(-°°, x + £)> we see that C C B(R), and thus a(C) C (R). (Why?) 

Conversely, suppose first that I = (a,b) is an open interval. Then / = (— oo,a] c n 
\J n (—oo,b — i], so / G o~(C), i.e. a(C) contains every open interval. 

Now since every open subset of R can be represented as a union of countably many open 
intervals^} we see that cr(C) contains every open subset of R. Thus 0(R) C a(C). 



4 Let U C R be a bounded open set, and let {q n : n £ N} enumerate the rationals in U. Define r„ : = 
sup{r : (q n — r,q n + r) C U}, and let /„ = (q + r„). Then (J„ In Q U. Conversely, if x € U, 



12 



Measures 



1.3 Measures 

The notion of measure generalizes the concepts of length, area, volume, mass, and probability. 



Definition 1.3.1 Let J 7 be a cr-algebra on a set fi. A function [i : T — > M. is called a 
(countably additive, non-negative) measure if and only if 

(i) fi is non-negative: < \xA < oo for each A G J 7 . 

(ii) p0 = 0. 

(iii) /i is countably additive (or cr-additive) : If A\, Ai, • • • G J 7 is a countable sequence of 
pairwise disjoint sets, then 

n n 

If /if) = 1, then ji is called a probability measure. 



If J 7 is a cr-algebra on a set f2, then the pair (O, J 7 ) is called a measurable space. The 
elements of J 7 are called measurable sets, or events in the probabilistic framework. If, in 
addition, fi is a measure on J 7 , the triple (Q,T,fi) is called a measure space. The symbols 
P,Q are used for probability measures, and (f2, J 7 , P) will always denote a probability space. 

Example 1.3.2 Important: Lebesgue Measure: We shall soon prove that there is a 
unique measure A on 0R, B0R)), which assigns every interval its length, i.e. 

A(a, b) = A(a, b] = A[a, b] = b — a 

This measure is called Lebesgue measure, and provided the original impetus for the develop- 
ment of the subject of measure theory. 



Example |1. 1.8 makes clear that Lebesgue measure is also important in probability theory: 
Consider the experiment of drawing a uniformly distributed random number from the unit 
interval [0, 1]. The probability of drawing a number between a and b (where < a < b < 1) 
is IP([a, b]) = b — a. Thus the appropriate measure F is just Lebesgue measure, restricted to 
[0,1]. 

There are higher dimensional analogues of Lebesgue measure: There is a measure, also 
denoted A and called Lebesgue measure, on (IR n , B(W n )), which assigns to every n-dimensional 
rectangle its volume. 

□ 



Example 1.3.3 Let Q be a set. For A C Q, define |^4| = no. of elements of A. Then | • | 
defines a measure on (Q,V(Q)), called counting measure. 

□ 

there is e > so that (x — e,x + e) C U. Choose q n such that \x — q n \ < |. Then if \y — q n \ < |, we have 
\ x — y\ < \x — q n \ + \q n — y\ < e, and thus x £ (q n — | , q n + |) C (x — e, x + e) C U. It follows that r„ > §, 
and thus that x £ /„. Hence U C M /„. If U is not bounded, note that U = \J (U n (— n, n)) is a union of 
bounded open sets. 



Events and Probabilities 



13 



Exercise 1.3.4 Let (X, F) be a measurable space, and let xq G X. Define 6 Xo : F — > R by 



1 if x G F 
if x F 



Show that (5 Xo is a measure on (X, F) . 

5 XQ is called the Dirac measure, or point mass, at xo- 

□ 

Example 1.3.5 Suppose that F : R — >■ R is an increasing right-continuous function, i.e. 
F(s)<F(t) when s<t and F(t) = limF(s) 

We shall prove later that there is a unique measure [If on (R,2?(R)) with the property that 

H F {a, b] = F(b) - F(a) for all a < b G R 

/j,f is called the Lebesgue-Stieltjes measure associated with F. Note that if F(t) := t, then 
= A is Lebesgue measure. 

□ 

Note that, for general measures, we allow +oo as a value. For example, the length of the 
real line is +oo, so A(R) = +oo, were A is Lebesgue measure on (R, B(M)). However, we often 
need to get a "handle" on infinity: 



Definition 1.3.6 A measure pona measurable space (fi, F) is called 

(a) finite, if fiQ < oo; 

(b) a -finite, if O is the countable union of sets of finite measure, i.e. if there is a sequence 
A\,Ai,... of measurable sets such that each fiA n < oo, and such that = (J n A n . 



Exercise 1.3.7 (a) Lebesgue measure on (R,S(R)) is cr-finite, but not finite, 
(b) Counting measure (R,H(R)) is not cr-finite. 

□ 

The following exercise is often useful: 

Exercise 1.3.8 Suppose that (O, F, n) is a measure space, and that A G F. Define 

FDA = {FD A : F G F} 

(this is an abuse of notation), and let ha = (i\FCiA. Then (A, FPiA,/j,a) is a measure space 
also — the restriction of (Q,F,[i) to A. 

□ 



14 



Properties of Measures 



1.4 Properties of Measures 

1.4.1 Additivity properties 
Proposition 1.4.1 (Additivity Properties ) 

Suppose that (fi, J 7 , /t) is a measure space, and let A, B, A±, A2, • • • € T . 

(a) If AC. B, then fiA < fiB. 

(b) If AC B and [iA < 00, then fj,(B - A) = fiB - fiA. 

(c) n{A U B) = fiA + fiB - fi(A n B) 

(d) Countable Subadditivity: [i(\J n A n ) < ^„/iA n . 

Proof: Suppose that AC B. Then B = A U (B — A) is a union of disjoint sets, and hence 
\iB = \iA + \i{B — A). It follows that \iB > \xA. Furthermore, if \xA < 00, then we obtain 
fiB — fiA = fi(B — A), proving (a), (b). 

(c) follows from the fact that Au B can be written as a disjoint union AliB = Au(B — 
A n B), so that n(A U B) = (j,A + fi(B - An B) = (iA + fiB - fi(A n B). 

(d) : Define B n inductively by 

B\ := A\ B n+ i := A n+ \ — A n 

Then B n C A n for all n, so that fiB n < fiA n . Moreover, A n = [J k<n Bk for all n, and 
Un A n = Un B n- Hence 

m(U^) =v[U B n) = ^B n <^ l lA n 

H 

Exercise 1.4.2 Suppose that (^J 7 , P) is a probability space, and that A, A±, A2, ■ ■ ■ £ T . 
Show that if FA n = 1 for n G N, then P(f| n A n ) = 1 also. 

□ 

Next, we introduce some useful terminology: 



Definition 1.4.3 Let ($7, F, ji) be a measure space, and let ACQ. 

(a) We say that A is /i-null if there exists Be J such that AC B and [iB = 0. 

(b) We shall say that a statement holds ^-almost everywhere (or fi-almost surely in the 
probabilistic framework), if the set of u G where 99 fails to hold is //-null. 



We abbreviate //-almost everywhere and //-almost surely by //-a.e. and /t-a.s. respec- 
tively. 



Events and Probabilities 



15 



Remarks 1.4.4 Note that in (a) above, the set A might not belong to T so fiA might be 
undefined. However, fiA "ought" to be zero. Later, this insight will allow us to extend 
measures to a-algebras larger than the ones we start off with. 

As an example of (b), consider the reals with Lebesgue measure: (R, B(R), A). Every point 
is A-null, since \{x} = X[x, x] = x — x. Hence the set Q of rational numbers is A-null: The 
set Q is countable, and has an enumeration Q = {q n : n G N}. By countable additivity, 

AQ = ^A{^ n } = 

n 

If the functions /, g : M. — > R are defined by 



/ := g(x) := Iq(x) : = 




if x G Q 
else 



Then 

/, g are equal A-almost everywhere 

□ 

Exercise 1.4.5 (a) Let N denote the family of all ^-null sets. Prove that Af is closed under 
countable unions. 

(b) Prove that if (f2, J 7 , P) is a probability space and if A n £ T are such that F(A n ) = 1 for 
all n £ N, then P(f| n A n ) = 1 also. 

□ 



1.4.2 Limit Operations on Sets 



Definition 1.4.6 Let (^4 n )„ 6 N be a sequence of sets. 

• We say that (A n ) n is increasing if A\ C A2 C A3 C . . . . We say that A n f A if 
(A n ) n is increasing, and (J n A n = A. 

• Similarly, we say that (A n ) n is decreasing if A\ 5 A2 5 A3 D . . . . We say that 
A n I A if (A n ) n is decreasing, and f] n A n = A. 

• We define the limit superior of the sequence (A n ) n by 

00 00 

lim sup A n = P| [J A k 

n=l k=n 

• We define the limit inferior of the sequence {A n ) n by 

00 00 

lim inf A n = (J f]A k 

n=l k=n 



16 



Properties of Measures 



Also note the following simple interpretations of the above limit operations: 

oo 

x G lim sup A n 4=> Vn [x G A k \ 

k=n 

44> Vn 3k > n [x G A k ] 

44> x belongs to infinitely many of the sets A k 

Similarly, 

oo 

x G lim inf A n 44> 3n [x G Pi A k ] 

n ' ' 

k=n 

O 3n \fk > n [x G A k \ 

44> x belongs to all the A k from some n onwards 

In particular, x G lim inf n A n iff x belongs to all but finitely many of the A n . 
Thus we also write: 

(A n ,\.o.) = lim sup A n (A n ,ev.) = lim inf A n 

n n 

where i.o. means infinitely often, and ev. means eventually. 



Thus x G (A n A.o.) iff x belongs to infinitely many of the sets A n , etc. 



Proposition 1.4.7 (a) \imini n A n C limsup^^ 
(b) ( limsup n ^4„) c = lim inf n A c n , ( lim inf n A n ) c = limsup n A c n 



Exercise 1.4.8 Prove Proposition 1.4.7 



□ 

Remarks 1.4.9 We recall the definitions of limsup„, liminf n of sequences of real numbers, 
which we shall need in the next section. 

Recall that if {x n ) is a set a sequence of real numbers, then 

limsupa; n := lim supx/% liminfx ra := lim inf x k 

n n— >oo k>n n ' rwoo k>n 

If we define y n := sup fc>n x k , then (y n ) is clearly a decreasing sequence in fit := KU {— oo, oo} 
and hence convergent (possibly to zboo), and. limsup^x^ = lim n y n . Similarly, if we de- 
fine z n := m.i k y n Xk, then (z n ) is an increasing sequence in fit, and thus convergent, and 
liminf„x n = lim n z n . 

The limsup n and lim inf n of a sequence always exist (also allowing ±oo), even if the limit 
does not. It is easy to verify that lim n x n exists if and only if limsup n x n = liminf n x n , in 
which case lim n x n = lim sup n x n = lim inf n x n . 

Further note that if lim sup n a n > b, then a n > b for infinitely many n. Conversely, if 
a n > b for infinitely many n, then limsup n a a > b. Similarly, if b > liminf n a n , then there is 
n such that b > a k for all k > n, i.e. b > a n eventually. Conversely, if b > a n eventually, then 
b > lim inf n a n . 



Events and Probabilities 



17 



□ 

Exercise 1.4.10 We explain here the relationship between limits of sets and limits of num- 
bers. 

For ACfl, define the indicator function I a ■ fi — > {0, 1} of A by 



I A : tt -> E : w H> 



1 if w G A 
else 



Elsewhere in the mathematical literature, indicator functions are often called characteristic 
functions. 

Suppose that (A n : n 6 N) is a countable sequence of subsets of fi. Show that 

(a) limsup n I An = iii msU p„A„ and liminf n /A„ = hmmf n A n - 

(b) A n — > A if and only if Ia„ — > Ia- 

□ 

1.4.3 Limits of Sets and Measures 



Proposition 1.4.11 (Continuity properties) 

Suppose that f2, J 7 , fi) is a measure space, and let A±, A2, • • ■ G T . 

(a) If A n | A, then fiA n f //A 

ffr) //^4n 4 A, and if fJ,A no < 00 for some uq, then [iA n \. /iA. 



Proof: (a) Define B n inductively for n G N by 

B\ := A\ B n+ \ = A n+ i — A n 

Then 

n n 

fiA = \A n ) = \B n ) = V \jj.B n = lim uB k = lim ^(11 B k ) = lim /i(A n ) 

n n n k=l k=l 

(b) Note that D^=i A n = C\^ =rLQ A n , so we may assume without loss of generality that uq = 1. 
Then A\ - A n f A - A as n — > 00, so that - A n ) f fi(Ai - A). Now recall that 

fjt(A 1 -A) = n(Ai)-n(A). 



Exercise 1.4.12 (a) Show that Propn. 1.4.11[ b) may fail if we drop the assumption that at 
least one of the \iA n is finite. 

(b) Suppose that \x is finitely additive on the measurable space (fi, J 7 ). Show that if 

fj,A n — > whenever A n J, 

then (j, is countably additive. 



18 



Extension of Measures 



□ 

We end this section with some important results: 



Proposition 1.4.13 (a) FATOU'S LEMMA: If y, is a measure on (ft, J 7 ), and if 
A\,A2, ■ • • £ J, then 

/i(liminf A n ) < liminf yA n 

n ' n 

(b) REVERSE FATOU LEMMA: If fi is a finite measure on ($7, J 7 ), and ifA u A 2 , 
then 

^(limsup^j) > limsup/M n 



Proof: (a) Let B n = C\ m> A m . Then liminf n A n = \J n C\m>n A m = [j n B n , so B n f 
liminf n 74 n . By Propn. 1.4.11[ a), we see that \xB n f /i(liminf n Also, clearly B n C yl n , 



and so yA n > \xB n . It follows that 

liminf yA n > liminf fiB n = /^(liminf A n ) 

n n n 

(b) is left as an exercise. 

H 

Exercise 1.4.14 We prove the reverse Fatou lemma: 

(a) Let B n = IJm>n A m - Explain why B n i limsup n A n . Conclude that fiB n i /i(lim SUp n j4n) 

(b) Explain why fiB n > sup \xA m . 

m>n 

(c) Conclude that /Li(limsup n ^4 n ) > lim sup yA m = limsup n \xA n . 

(d) Where did we need the fact that [i is a finite measure? 

□ 

1.5 Extension of Measures 
1.5.1 Other Families of Sets 

ex-algebras can be quite complicated to deal with, and in many cases it is easier to work with 
simpler classes of sets. We therefore "break up" the notion of <7-algebra into two parts, that 
of ir-system and X-system. The definitions and results below are quite complicated, but what 
you need to know is the following: 

• A system A of sets is a u-algebra iff it is both a 7r-system and a A-system. 

• The notion of A-system meshes very nicely with the continuity properties of measure. 



Events and Probabilities 



19 



Therefore to prove that something is true for all events of a cr-algebra it often suffices 
to show that it is true for the events in a 7r-system that generates the cr-algebra. The 
events in such a 7r-system can often be very simple (e.g. they may be just intervals, 
or "rectangles"), and are easier to handle than the very complicated events that may 
occur in c-algebras. 



Definition 1.5.1 Let C be a collection of subsets of 

(a) C is called a 7r-system if it is closed under finite intersections. 

(b) C is called a A-system if 

(i) Q G C; 

(ii) A, B G C and A C B implies B - A G C; 

(iii) If Ai, A 2 , ■ ■ ■ G C and A n f A, then A £ C. 

(c) We denote by 7r(C) and A(C) the 7r— , respectively, A-system generated by C, i.e. the 
smallest 7r— , respectively, A-system on which contains C. 



Note that the intersection of an arbitrary family of 7r-systems is a 7r-system, so vr(C) is just 
the intersection of all 7r-systems that contain C. In particular, tt(C) always exists. Similar 
remarks apply to A(C). 



Proposition 1.5.2 A family C of subsets of £1 is a a-algebra iff it is both a TT-system 
and a X-system. 



Proof: It is clear that a cr-algebra is also a 7r-system and a A-system. 

Conversely, suppose C is both a tt- and a A-system. Then C is closed under complemen- 
tation, by (i), (ii) of Defn. 1.5.1[ b). De Morgan's Laws applied to Defn. 1.5.1[ a) show that 



C is closed under finite unions. Finally, given A\, A%, • ■ ■ G C, let A = (J n A n . Define 

B n = |J A r 



Then each B n G C, and B n f A. Hence A G C, by Defn. 1.5.1[ b)(iii), and thus C is closed 
under countable unions. 



Exercise 1.5.3 Suppose that T> is a A-system on a set fi, and that A £ T>. Prove that then 
V A := {D G V : D n A G £>} is also a A-system. 

□ 

The following technical result often allows us to work with "easy" 7r-systems, instead of 
the "difficult" u-algebras: 



Theorem 1.5.4 (Dynkin's Lemma, Monotone Class Theorem) 

(a) If C is a tt -system on 0, then 

A(C) = a(C) 

(b) Suppose that C is a % -system and that T> is a X-system (both on a set Q), and also 
that CCD, Then a(C) C V. 



20 



Extension of Measures 



Proof: (a): Let P = A(C). By Propn. 1.5.2, it suffices to show that P is a 7r-system. We do 

this in two steps. 

STEP I: Fix C G C, and define 

V c = {A £ P : A n C £ P} 

Then C C P^ C 2? (because C is a 7r-system). We now show that T>c = P. To that end, it 
suffices to show that T>q is a A-system (because then T>q is a A-system containing C, and 2? 



is the smallest such). We therefore verify (i)-(iii) of Defn. 1.5.1 
(i) is obvious. 

If A, Be V c and AQB, then (B - A) n C = (B n C) - (A n C). But 5 n C, A n C G P by 
definition of Pc, and thus (B - A) fl C G P, because V is a A-system. Thus (B — A) G Pes 
which shows (ii) 

Finally, if A l ,A 2 , ■ ■ ■ G V c and A„ f -4, then n C, ,4 2 n C, ■ ■ ■ G V and (i n nC)tAn C. 
Hence An C £ V, and so ^4 G Pes proving(iii). 

We now know that T>c = P for every C G C 
STEP II: Now, fix any D G P, and define 

P 13 = {A £ T> : An D £ P} 

First note that if C G C, then V c = V,so D £ V c . It follows that D (~)C £ T>, and thus that 
C G P D , for every C G C. Thus C C P D , for all D£V. 

It follows as above that T> D is a A-system, and thus that T> D = P, for all D £ P. 

In particular, if A, B £ P, then A G T> B , and so A fl B G P. This shows that P is a 
7r-system, and thus a cr-algebra. 

(b): Under the stated conditions, we have cr(C) = A(C) C A(P) = P. 



An important corollary of the preceding theorem is the following: Two probability mea- 
sures which agree on a 7r-system agree also on the a-algebra generated by that 7r— systrem. 



Proposition 1.5.5 Suppose that m, fj,2 are finite measures on a measurable space (O, J 7 ), 
and let C be a n-system such that Q £ C and o~(C) = T . Then m = n% iff fJ>i,fJ,2 agree on 
C. 



Proof: (Outline) Use the continuity of measure to show that the set P := {F £ T : \x\F 
is a A-system which contains C. Then deduce that P 5 cr(C) = T . 



1.5.2 The Extension Theorem 



This section is mainly technical: The Caratheodory Extension Theorem (Thm. 1.5.9) ensures 
that commonly used measures, such as Lebesgue measure, actually exist. 



Events and Probabilities 



21 



Definition 1.5.6 Let f2 be a non-empty set. 

(a) A map fx : V(Q) — > [0, +oo] is said to be an outer measure on Q iff: 

(i) fjfH = 0; 

(ii) /i is monotone: A C B implies fiA < fiB; 

(iii) fi is countably sub-additive: If A±,A2, ■ ■ ■ C J7, then /^(|J n ^4 n ) < ^2 n nA n . 

(b) If /U is an outer measure on f2, then a set A C is said to be fi-measurable if and only 
if 

^E = ,it(E (~)A) + fi(E n .4 C ) for all E C ft 

Note that we require to be defined for every subset A C $7. ^ may not be a measure, 
i.e. may not have the countable additivity property for every sequence A n of subsets of 
Q. Nevertheless, it may be possible to find a cr-algebra M of sets on which /i is countably 
additive. What might such a cr-algebra look like? Here is a candidate: 

Theorem 1.5.7 Let [/, be an outer measure on ft and let A4(fi) be the family of all im- 
measurable sets. Then (ft, (i\A4(fJ,)) is a measure space, i.e. is a a-algebra, 
and fi is countably additiive on Ai(fi). 

Proof: Certainly G M(fi), as fi(E) = fi(E n 0) + \x{E n C ). Also, it is obvious that 
A G M(fJt) implies A c G M(n). 

Next, we show that M(n) is closed under finite intersections: Let A,Be M(fi), and let 
E C ft. Then 

fiE = n(E r\A) + n{E n A c ) 

= n(E n AnB) + n(E n An B c ) + h(e n A c ) 
= fi{E n(An B)) + n(E n(An B) c ) 
> nE 

where the third line follows because n(En(AnB)°) = n(En(AnB)°nA)+fj,(En(AnB) c nA c ) = 
n{E nBn A c ) + n(E n A c ), and the final line holds because \x is sub-additive. It follows that 
A n B G M{(jl). Hence M.{n) is an algebra. 

Next, let A,Be M{p) be disjoint, and let E C ft. Then B C A c , and hence 

n(E n(Au B)) = n(E n(Au B)) n A) + ^(E n (A u 5) n = /u(E n A) + n(E n B) (*) 

Specializing to E = ft proves that is finitely additive on Ai(ji). 

Now let Ai, ^2, . . . be a countable sequence of disjoint elements of M(fi), and let E C ft. 
Define B n = \J m<n A m , and let A = \J n A n = (J n B n . Then by monotonicity and (*), we see 
that 

n{E nA)> h(e n B n ) = n A m ) 

Letting n — > oo, and invoking the fact that fx is subadditive, we see that 

n(EnA)>^2fiA n >iJ,(A)>fi(EnA) (**) 

so that equality holds throughout. Countable additivity of /i on Ai(fi*) is then obtained by 
specializing to the case E = ft. 



22 



Extension of Measures 



Also, since B n G we see, by monotonicity and (**), that 

uE = n(E n B n ) + fi(E n B£) 

> v(E n A n ) + n(E n A c ) 

m<n 

-> / u(£' n A) + ^(E n ^4 C ) 

> m-E 

Hence equality holds throughout, and so A G M(n). This proves that M(fi) is a er-algebra. 

H 

Here is one of the most important ways of obtaining outer measures: 

Proposition 1.5.8 Let A be an algebra on a set Q, and let no be a non-negative countably 
additive set function on A (i.e. whenever A n 6 A are disjoint and also \J n A n G A, then 

WiiA) = En MO (At) ■) 

Define fi* : V(Q) [0, +oo] by 

fi*E = inf | ^2 f^oAn : (A n ) n a sequence of sets in A with E C |^J ^4 n | 

n n 

Then (i* is an outer measure on O which extends no (i.e. /j,*A = /j,qA for all A £ A). 
Moreover, A C M(/i*). 

Terminology: If (A n ) n is a sequence in A such that E C \J n A n , then we call (A n ) n a 
^.-covering of £\ 

Proof: To see that /i* extends /io is e&syi For let A G Firstly, if (v4 n ) n is a sequence in A 
which covers A, then fi^A < ^ n ^oA n (by countable additivity of fio), and thus /xo^4 < fi*A. 
The sequence (A, 0,0,...) witnesses the fact that \x* A < hqA as well. 
Next, we prove that fi* is an outer measure on fi. 

(i) /x*0 = n$ = 0; 

(ii) If E C F, then any ^.-covering of F is also a covering of E, and thus /j.*E < n*F. 
(hi) Finally, suppose that E n C $7, and let £" = |J n E n . We must show that 

fl*E<J2v*En (*) 

n 

Fix e > 0. For each choose a sequence (A^jAk in „4 such that 

fc 

Then {A nk : n, k G N} is a countable ^.-covering of -E 1 , and thus 

n,fc n k n n 

Since e is arbitrary (> 0), (*) follows. 



Events and Probabilities 



23 



Next, we show that every member of A is /x* -measurable. So let B G A4(/j,*), and let E C fi 
be arbitrary. Let e > 0. Choose a sequence (Ai)n in A which covers E, and such that 

Yj^oAn <fJ*E + S 
n 

Then (A n n S)„, (A n n -B c ) n are, respectively, sequences in A which cover E n B and E n B c 
It follows that 

fj,*(E n B) + n*{E n £? c ) < ^ (/x (A n nB) + // (A n n B c )) = ^ /x*A n < /t*£ + e 

n n 

H 

Since e > is arbitrary, it follows that 

H* (E n B) + fi* (E <1 B c ) = n*E forallFCfi 

(we proved <, but > always holds, by sub-additivity.) Hence B is //-measurable, for every 
B G A. 



Theorem 1.5.9 (Caratheodory's Extension Theorem) 

Let A be a an algebra on a set fi ; and let /tp be a countably additive non-negative set 
functior^ on A, such that /fo0 = 0. Then /zo extends to a measure fi on cr(A), i.e. there 
is a measure fi on (fi, <j(A)) such that /to A = \xA for all A £ A. 
Moreover, if [j,q is a -finite, then the extension /t o//to is unique. 

"This means that if A\, A<x, • • • 6 A are mutually disjoint, and if also [J A n 6 A, then /to(U n An) ~ 



Proof: Let F = a (A), and define /t* on (fi,7?(fi)) by 

fx* E = inf | ^2 MoA n ■ (A n ) n a sequence of sets in A with E C |^J A n | 



Then by Propn. 1.5.8, /i* is an outer measure which extends hq. By Theorem |1.5.7 /i* is 



a measure on the cr-algebra .M(/t*) of all //* -measurable sets, and A C .M (//*). Hence also 
<r(A) = J 7 C 7W(/U*), and /t = ^l-T 7 is a measure on (fi, F). 

We now turn to the uniqueness of the extension. First suppose that /tofi < oo, and 
suppose that v is another extension of \x§ to F . Since fi G A, we have ^fi = //fi. We must 
show that iiF = vF for all F G T '. Let (A n ) n be a A-cover for i 7 . then 

VF < fA n = ^ A*0 

n n 

because t/|A = /^o- Since (A n ) n was an arbitrary A-cover of F, we must have vF < fi*F = fj,F, 
for all F G J 7 . Thus also z^-F c < /t-F c , and hence 

uQ = uF + vF c < fiF + /xF c = /ifi = z/fi 

It is now easy to see that zvi 71 = /t-F, for all F £ F. 

Finally, assume that /to is cr-finite on (fi,A). then there exists a sequence A n G A 
such that A l t an d such that /tA n < oo (for all n G N). As above, we can prove that 



^(-F n A n ) = i/(.F n A n ), for all n G N, and using Propn. 1.4.11 , we see that vF = \xF in this 
case also. 

H 



24 



Lebesgue Measure 



1.6 Lebesgue Measure 
1.6.1 Lebesgue measure on E 

In this section, we construct the natural notion of length, namely Lebesgue measure on 



First, we define the Lebesgue outer measure X* : V(M) — > [0, +oo]. (cf. Defn. f.5.6 for 
definition of outer measure.) Given an interval /CM, define |7| to be the length of the 
interval. (We used the notation C{I) for this in §1.2, but now reserve the symbol C for 
another purpose.) Then, for iCl, define 

oo 

A* (A) = inf { \In\ '■ each I n an interval, A C I n j 

n=l n 

Remarks 1.6.1 Note that we also have 

oo 

A* {A) = inf { \ I n \ : each I n a finite open interval, A C ^J/n} 

n=l n 

For let XA = inf { Yn°=i • each I n a finite open interval, A C (J / n |. Then clearly X*A < 
XA. To prove the reverse inequality, fix A C R, and choose intervals I n such that Y n \^n\ < 
A*^4 + |. If |J n | = +oo for some n E N, then clearly X*A = +oo, in which case XA = XA* 
(i.e. there is nothing to prove). Hence we may assume that each I n is a finite interval. 
Now let J n be an open interval such that I n C J n and \J n \ < |J n | + e2 _n_1 . Then each 
AA < Yl n \Jn\ < Y n \In\ + | < + e - Letting e 1 0, we see that Avl < A*^, as required. 

□ 



Proposition 1.6.2 A* is an outer measure on M, and X*I = \I\ for every interval I. 



Proof: It is clear that A* is a monotone increasing non-negative function with A*(0) = 0. To 
prove that A* is also countably sub-additive, let A\,A2, ■••CE, and fix an arbitrary e > 0. 
By definition of A* we may, for each n G N, choose open intervals I n ,i,In,2, ■ ■ ■ such that 

A n C (J I n>k J2 \ J n,k\ < >*A n + e2~ n all n E N 

Then 

U a„ c j U /„,* a* ( A n ) < ^2 ^2 \ r ^\ < E A * A » + e 

n n k n n k n 

Since e was arbitrary (> 0), it follows that A*(U n A n ) < Yn^*An- 

It remains to show that X*I = \I\ for every interval / CM. It is obvious that we always 
have X*I < \I\. To prove the reverse inequality, first assume that I is a compact interval (i.e. 
I = [a, 6] j for some — oo < a < b < +oo). Choose finite open intervals (a±, bi), (a2, 62)) ■ • • 



such that [a, b] C |J n (a n , 6 n ) — cf. Remarks 1.6.1 By compactness, there is n such that 



[a,b] C [J(a k ,b k ) 



k=l 



We now check by induction on n that b — a < Yk=i(bk — a k)- This is obvious if n = 1. 
Now suppose that the assertion is true for n — 1, and let (01, 61), . . . , (a n , b n ) be a cover for 



Events and Probabilities 



25 



[a,b]. Without loss of generality (relabelling if necessary), we may assume that b £ (a n ,b n ). 
Then (01, 61), . . . , (a n _i, b n -\) is a cover of the interval [a, a n }- By induction hypothesis, we 
have a n — a < Y^k=\(Pk — «fc)> and hence 

n— 1 n 

b - a = (b - a n ) + {a n - a) < (b n - a n ) + ^(kfc - a k ) = ^(kfc - Ofe) 

k=l k=l 

as required. 

It follows that b — a < ^fcLife ~~ a k)- Since the (a n , 6 n ) were an arbitrary open covering 
of [a, b], it follows that b — a < \*[a,b], i.e. that |/| < X*I for every compact interval I. 

It remains to deal with intervals that are non-compact. If / is a bounded interval, then 
there is a compact interval J such that J C I and |/| < | J| + e (where we fix e > 0). It 
follows that I J I < A* J + e. Now since also A* J < A*/, we must have |/| < A* J + e < A*/ + e. 
Letting e 1 0, we see that |/| < \*I if I is a bounded interval. 

Finally, if I is an unbounded interval, then \*I = +00: Indeed, if K > is arbitrary, 
there is a compact interval J C J such that |J| > if . Then 

A*/ >X*J>K 

Letting if f °°> we see that \*I = +00 = |J| when I is unbounded. 



Now that we have constructed an outer measure A*, it follows by Thm. |1.5.7 that there 



is a cr-algebra C(M) on E such that A = A*|£(K) is a measure on (R, £(R)). Indeed, we 
have £(R) := A^(A*), the family of all A*-measurable sets. The cr-algebra £(R) is called the 
a -algebra of all Lebesgue measurable sets, or the Lebesgue algebra, on R. 

Our next aim is to show that 0(R) C £(R), i.e. that every Borel set is Lebesgue measur- 
able. 

Proposition 1.6.3 Every Borel set is Lebesgue measurable. 



Proof: It suffices to prove that every interval of the form (—00, a] is Lebesgue measurable, 



because the collection of intervals of this form generates B(R), cf. Propn. 1.2.8 Let E C R 
be arbitrary, and let I = (—00, a]. Fix e > 0, and choose intervals Ji,/2, ... such that 
l-^nl < ^*-^ + £ - Note that if J is an arbitrary interval, then so are J (1 1,J H I c , and 

|j| = \ jni\ + \Jni c \. 

Now 

X*E + e> ^|J n | 

n 

= 5^|/„n/| + ^|i n ni c | 

n. n 

> A*(En/) + A*(£nf) 

Letting e 1 0, we see that X*E >X*(EnI) + \*{E n i c ), for all fiCl 



We have almost proved the following theorem: 



2(3 



Lebesgue Measure 



Theorem 1.6.4 There exists a unique measure X on '. 
interval I. 



such that XI = \I\ for every 



Proof: By Propn. 1.6.2, A* is an outer measure with X*I 



\I\ for every interval, and thus 
on M such that A = A*|£(M) is a 

measure on (M,£(M)). By Propn. |1.6.3[ B{1 

It remains to prove uniqueness: Suppose that [i is another measure on (R, B(M)) with the 



it follows by Thm. 1.5.7 that there is a cr-algebra £(. 

C C(R). 



property that [il = \I\ for all intervals I. Let I n = [—n,n]. Then A n := X\I n 



v \I n are 



finite measures on (I n ,B(I n )). Now C n = { J : J an interval, J C /„} is a 7r-system which 
generates B(I n ), and \ n ,v n agree on C n . Hence, by Propn. 1.5.5, \ n ,fJ, n agree on B(I n ). 



Now, if B e B(R), then by Propn. 1.4.11 



XB = lim X n (B n I n ) = lim fj, n (B n I n ) = /j,B 



Remarks 1.6.5 It is easy to see that Lebesgue measure can be extended to the extended real 
number system M = [— oo, +oo]. The Borel algebra on M is generated by all sets of the form 

[— oo,a) (a,b) (6, oo] a,&€lR 

(which forms a base for the topology on M). Thus B(M.) is the family of all sets of the form 

B BU{+oo} BU{-oo} B U {+00,-00} 

where B is an ordinary Borel set. 

□ 



1.6.2 Lebesgue Measure on M. d 

It is possible to modify the construction of Lebesgue measure on M. to obtain a unique measure 
on B(M. d ), also denoted by A, which assigns to every rectangle its volume: A rectangle in 
R d is a set 



R = Ii x I2 x • • • x Id where I\ , I2 , ■ . ■ , Id, are intervals in R 

The volume of such a rectangle is defined by vol(i?) = x I/2I x ■ • • x |/^|. 
We can then define a map A* : V(R d ) -»■ [0,+oo] by 

00 

X*(A) = inf I vol(-Rn) : each R n a rectangle, A C 

n=l n 

and show that A* is an outer measure, and that every Borel set is A*-measurable. 

Instead of performing the construction outlined above, we elect to wait until we have 
constructed products of measure spaces: Given measure spaces (Oj, Ti,fJ,i), where i = 1, . . . , n, 
it is possible to construct, in a canonical way, a measure space 

i<n i<n i 



Events and Probabilities 



27 



It will turn out that 

B{R d ) = (g)£(R) A^JjA 1 

i<d i<d 

where X d denotes the d-dimensional Lebesgue measure. 

Thus it is possible to construct (R d , B(R d ), X d ) directly from (R, B(R), A), and a repetition 
of the construction of Lebesgue measure proves unnecessary. 

1.6.3 Lebesgue Measure and Coin Tossing 

The following exercise explores the relationship between Lebesgue measure — the probability 
measure for the random experiment "Pick a number between and 1" — and the tossing a 
fair coin infinitely many times. 

Exercise 1.6.6 Consider a random experiment in which a fair coin is tossed infinitely many 
times. We want to find an appropriate probability space for this experiment. 

Let 1 stand for "Heads" and for "Tails" . We may want to take the sample space to be 

Q = {0, 1} N 

i.e. ft is the set of N-indexed sequences of O's and l's. We take a slightly different view, 
however. Every sequence of O's and l's can be regarded as the binary expansion of a real 
number — cf. remarks on base n-expansions at the end of this assignment sheet. For 
example, the sequence 

110 10 1 ... corresponds to the real number (0.1101001 ... ) 2 

The only problem is that this correspondence is not one-to-one: 

(i,o,o,o,...) ->1 (o,i,i,i,. ..)->! 

Let T denote the set of all sequences that eventually end in all O's A little thought shows that 
the correspondence above is a bijection between Vl — T and (0, 1]. 

(a) Show that T is a countable set. Explain why P(T) = 0. 

It is therefore practically certain that the event T won't occur. The terminating binary 
expansions are therefore, in a sense, redundant. Nothing is lost by chucking them out (except 
a set of probability 0). We may therefore take the sample space to be the set 

Q = (0, 1] 

(a) Show that 

(i) The event that the first toss is H is the interval 
A = ((0.1000... ) 2 , (0.1111... ) 2 ] = (|,1]. 

(ii) The event that the third toss is T is the set 

* = (0,§]u(l,f]u(f,§]u (§,?]. 

(iii) The event that the first H turns up after on an even toss is 



28 



Lebesgue Measure 



Here are some natural c-algebras for this random experiment: 

T n := all events that can be decided after n finitely many tosses 
T := all events that can be decided after only finitely many tosses 

(a) (i) Explain why T\ = <r((0, |], (5, 1]). Show that AeTi, but that B,C <£T X . 

(ii) Explain why T 2 = <r((Q,\], {\,\\, (§,§], (f,l]). Show that A G Jb, but 5, C J" 2 . 

(iii) Explain why J" n = <t((£, ^] : k = 0, 1, . . . 2 n - l), and why J" = cr((|r, ^] : n € 
N,fc = 0,l,...2 ra - l). 

(iv) Show that B G -F3, but C g" J-3. 

(v) Show that C S" J" n for any n G N, but that C G J 7 . 

(b) Explain why J 7 is precisely the Borel algebra on (0, 1] (i.e. the smallest a-algebra on 
(0, 1] which contains all the intervals). 

We have now seen that the "right" sample space for this experiment is (0,1], and that the 
"right" a-algebra is T = 23(0, 1]. Let's now try to identify the "right" probability measure P 
for this experiment. 

(a) (i) Explain why P(A) = \ = X(A) 

(ii) Explain why P(B) = \ = \{B) 

(iii) Explain why P(C) = \ = A(C). 

It therefore becomes apparent that the "right" probability space for the random exper- 
iment of tossing a coin infinitely many times is just the same as the random experiment of 
picking a number from (0, 1]: 

Borel's Principle: 

Consider the random experiment of tossing a (fair) coin infinitely many times, and let E 
be an event. Interpret E as a subset of (0,1]. Then ¥(E) = X(E), i.e. the probability 
that the event E occurs is just the Lebesgue measure of the associated subset of the unit 
interval. 



□ 



Chapter 2 

Random Variables 



2.1 Definition of Measurable Function 

Let / : A — > S be a map between two sets. Recall that / induces a set map f^ 1 : V(S) — > V{A) 
between the power sets — in the opposite direction — by 

f- 1 [T] = {a€A:f(a)€T} 

Remarks 2.1.1 Here is some motivation for the definition of measurable function. 

Suppose that X is a random variable on a probability space (fi,.F,P), i.e. a function 
X : SI — > R which assigns a number X{uj) to every outcome uj G S7 — we will make this 
notion more precise shortly. We would like to be able to discuss the probability that X = 0, 
or that X lies between —1 and 1, etc. Thus we'd like to know 

F(X = 0) := P({w G fi : X(uj) = 0}) = P(X _1 {0}) 
P(-l < X < 1) := P({w G Q : —1 < < 1}) = P(X- 1 [-1, 1]) 

However, F(F) makes sense only if F G T . Thus, in order to be able to discuss the above 
probabilities, it is necessary that the sets 

X _1 {0} := {u G Q : X{u) = 0} 1] = {co G : G [-1, 1]} 

belong to J 7 . 

More generally, given a Borel set B, we want to be able to discuss the probability that 
the outcome X(ui) belongs to B. For 

F(X G B) := F({co G : X(u) G 5}) 

to make sense, it is necessary that the set 

X~ X B = {u G n : X(u) G B} 

belongs to T . 

Thus: We can only meaningfully discuss the possible values of the random variable X in 
a probabilistic setting if X~ 1 B G T for every B G B(R), i.e. it is necessary that 

X _1 H(R) C J 7 



29 



30 



Definition of Measurable Function 



□ 

We begin with a little set theory: 
Proposition 2.1.2 IfT,T n C S, then 

(a) f-i[T c ] = (f- 1 [T]y 

(b) f- 1 V n Tn}=\J n f- 1 [T n ] 
(C) f-HHnTn] = fin r l [Tn] 



If / : A — > S, and S is a family of subsets of S, we denote by / 1 S the family of subsets 
of A defined by 

f- 1 S = {f- 1 [T]:TeS} 

Proposition 2.1.3 Suppose that (A, A), (S,S) are measurable spaces, and that f : A — » S 
is a map. Then 

(i) A 1 = f~ 1 S is a a-algebra on A. 

(ii) S' = {T C S : € .4} is a a-algebra on S. 
Exercise 2.1.4 Prove Propn. |2TL2] and |2X3 



□ 



Definition 2.1.5 (1.) Let (A, A), (S,S) be measurable spaces. A map A — > S is said to 
be A/ S -measurable if and only if C A (i.e. / _1 [T] G A for all T E S). 

If the cr-algebras ^4,5 are obvious from context, we simply call / a measurable func- 
tion. 

(2.) A measurable function X from a probability space (Q,J-) to (R, is called a 

random variable. 

A measurable function X from a probability space (SI, ^P) to B(R d )) is called a 
random vector. 

More generally, any measurable function from a probability space to a measurable 
space is called a random element. 

(3.) If S is a topological space, then a measurable function [S, B(S)) — > (R, B(R)) is called 
a Borel function. 

We are usually interested in the case where S = M or l d . 
Thus we have the following pullback condition for measurability: 



A function is a measurable iff pullbacks of measurable sets 
are measurable. 



Note the similarity with the definition of continuous function: A function / between 
topological spaces X, Y is continuous iff is an open subset of X whenever V is an 

open subset of Y, i.e. iff pullbacks of open sets are open. 



Random Variables 



31 



Remarks 2.1.6 (1) The notion of measure does not occur in the definition of measurable 
function/random variable: Only the measurable spaces ( = set + cr-algebra) play a role. 

(2) If (A, A) -4 (S,S) is measurable, then 

A -4 S <S C A 

(3) If X is a random variable on (fi, T) and B C R, we write 

{X G 5} for X^B = G n : G B} 

t _ _ 

(4) We will also allow extended real-valued maps: ^4 — >■ R, where R = [— oo, +oo]. 

□ 

Example 2.1.7 Let (5, 5) be a measurable space. For each A C 5 define the indicator 
function 

/a : 5 1 — >■ R by 

' 1 if s G A 
otherwise 



Ia(s) = 



If ^4 is a measurable set, i.e. A G S, and if B G 23(R) is a Borel set, then 



^1 = 



r if neither 0,1 £ 5 
A if 1 G B and G* £ 
A c if G £ and 1 £ 
I 5 if both 0,1 £ B 



It follows that I^[B] G S for every B G 23 (M), so that I a is a measurable function. 

Similarly, if Ia is a measurable function, then = A G S, since {1} is a Borel set. 

Thus: 

A is a measurable set if and only if I a is a measurable function. 

□ 

The above example is important enough to restate as a Proposition: 



Proposition 2.1.8 Suppose that (£1,7-) is a measurable space, and that A C Q. Then 
the indicator I a '■ f2 — > R is a measurable function iff A is a measurable set (i.e. I a is 
T /23(E) -measurable iff A G J"/. 



Exercise 2.1.9 Suppose that O is a set, that F := {0, Q} is the trivial cr-algebra on $7, and 
that Q := V(£l) is the powerset-algebra on f2. 

f 

(a) Determine all J r /23(IR)-measurable functions Q — > R. 



(b) Determine all £//23(IR)-measurable functions — »• 



32 



Combinations of Measurable Functions 



□ 

Exercise 2.1.10 Suppose that (5, S) are measurable spaces, and that / : £1 — > S is 

.F/cS-measurable. 

(1.) Show that if Q is a a-algebra on f2 such that J- QQ, then / is also ^/5-measurable. 
(2.) Show that if T is a cr-algebra on S such that TC5, then / is also -F/T-measurable. 

□ 

To check if a function is measurable, it suffices to check the pullback condition on a 
generating set: 



Proposition 2.1.11 Suppose that (O, (S, S) are measurable spaces, and that 17—7-5 
is a map. Suppose further that C is a family of subsets of S such that o~(C) = S. Then f 
is T I S -measurable iff f~ 1 C C T . 



Proof: (=>) is obvious: Clearly f~ 1 C C f~ 1 S, and if / is measurable, then / _1 <S C T by 
definition of measurability. 

(<=)■ Let T = {T £ S : / _1 [T] G J 7 }. Then C C 7", by assumption, and T is a cr-algebra, by 



Propn. 2.1.2 (Check this!) Hence S = a(C) Cf C5 i.e. T = S. 



Here are some special cases of Propn. 2.1.11[ 



Corollary 2.1.12 A function f : (fi,-? 7 ) — >■ is measurable iff one of the following 

conditions holds : 

(a) {/<c}eJ for all cel. (7?eca« f/iai {/ < c} := {w:e!l: f(uo) < c}.) 

(b) {/ < c} G J 7 /or all c£l. 

fcj {/ > c} eJ /or cel. 

(d; {/ > c} e J /or a// cel. 



Proof: (a) In Propn. 2.1.11 take (S, S) = (R, B(R)) and C to be the collection of all intervals 



of the form (— oo,c]. We already know that these intervals generate the Borel algebra on 



cf. Propn. 1.2.8 



(b),(c),(d) are proved similarly. 



Corollary 2.1.13 If X,S are topological spaces and X — > S is continuous, then f is 
B(X) /B(S) -measurable. 



Proof: In Propn. 2.1.11 take C to be the collection of all open subsets of '. 



Corollary 2.1.14 Any monotone function from R to IR is a Borel function, i.e. 
B(K) /B(R) -measurable. 



Proof: If R — > 1R is monotone, then {/ < c} is an interval (for all c£R), and thus in B(R). 



Random Variables 



33 



2.2 Combinations of Measurable Functions 

Measurable functions can be combined in a variety of ways to form new measurable functions: 



Proposition 2.2.1 (a) Suppose that f,g : T) —> R are measurable functions and that 
a G M. Then 

f + 9 f af fg f/g 
are measurable functions, where we assume g ^ on O for the case f/g. 

(b) If f n : ($7, — > K are measurable functions for n G N, then 

sup/„ inf/ n limsup/ n liminf/ n 

n n n n 

are measurable. 

(c) If f n ■ (QjJ 7 ) M are measurable functions for n G N, and i/ /„ — >• / pointwise on £1, 
then f is measurable. 



Proof: (a) Suppose that /, g are measurable. First, we show that / + g is measurable. By 



Propn. 2.1.11 it suffices to show that 

{/ + g > c} G T for all c G R 
Now /(s) + g(s) > c iff /(s) > c — g(s) iff /(s) > q > c — g(s) for some q G Q. Thus 

{/ + 5 > c} = |J ({/ > q} n { 5 > c - q} 

Now {/ > q}, {g > c — q} £ T because f,g are measurable. Since Q is a countable set, 
{/ + g > c} G T also. 



Next, we show that f 2 is measurable. This follows easily from Propn. 2.1.11 using the 
fact that 

' {-\/c < / < \/c} ifc>0 
else 



{f < c} 



To see that af is measurable is easy, e.g. if a > 0, then {af < c} = {/ < ^}. 
Next, to see that fg is measurable, use the polarization identity 

fg = \[{f + g?-U-g?] 

Finally, to see that ^ is measurable, it suffices to see that ^ is measurable. But if c > 

{\ < c i = < 5} n {<? > o}) u ({i > 5 } n { 5 < 0} 

Similar arguments work if c < or c = 0. 

(b) Note that 

{sup/ n > c} = M{/n > c} 

n 

n 

that inf n f n = — sup n (— f n ), that limsup n f n = inf n sup fc>n fk and that liminf^ f n = — limsup n ( 

(c) Note that if /„ -)■ /, then / = limsup n f n = liminf n f n . 



34 



Approximation by Simple Functions 



Corollary 2.2.2 If f,g : (Q,J-) — > M are measurable functions, then so are 

• / V g := max{/, g} and f A g := min{/, g}. 

• / + :=/V0 = max{/,0} and f- := — (/ A 0) = max{— /, 0}. 

• 1/1 



Proof: / V g = sup{/, g}, f Ag = inf{/, g}. Furthermore \ f\ = f + + f , as is easily verified. 

H 

Remarks 2.2.3 With respect to the above definitions, it is useful to note that: 

f = f + -r \f\ = f + + r 

□ 

Exercise 2.2.4 Suppose that (f n )n is a sequence of measurable functions from a measure 
space (S,S) to E. Prove that the set {s £ S : lim n /„(s) exists} is measurable. 

□ 

Measurability, like continuity, is preserved under composition: 



Proposition 2.2.5 If (A, A) — > (S,S) and (S,S) A (T,T) are measurable functions, 
then {A, A) 9 4 (T, T) is measurable. 



Proof: If C E T, then (go fY^C] = /^[cT^C]]. Now g'^C] G S because g is measurable. 
Thus /'~ 1 [fi , ~ 1 [C']] G A, because / is measurable. 



2.3 Approximation by Simple Functions 

We have already seen that an indicator function I a is a measurable function iff A is a mea- 



surable set. It follows that from Propn. 2.2.1 that linear combinations of such indicators are 
measurable as well. 



Definition 2.3.1 A measurable function (f2, J 7 ) — > K is called a simple function if ran/ 
is a finite set. 



Let (Q,J-) be a measurable space. Recall that a finite or countably infinite sequence 
[Fn)n of members of J- is said to form a partition of iff (i) The .F n are mutually disjoint 
(i.e. F n n F m = when n ^ m), and (ii) (J n F n = 17. 



Random Variables 



35 



Proposition 2.3.2 A measurable function f : (SI, J 7 — > K is simple iff it is a linear 
combination of measurable indicator functions: 

n 
i=l 

Moreover, the sets F{ G J 7 can 6e chosen to form a partition of VI. 

Proof: It is obvious that a function of the form / = ^™ =1 C «-^F; (where each Fi G J 7 ) is 
simple: / can only take on values which are sums of finitely many of the q. 

Suppose now that / is simple, i.e. that ran/ = {c±, . . . , c n } is a finite set. Define Fi = 
f _1 {ci} for % = 1, . . . , n. Then the Fi form a partition of U, and / = Y^i=i c ^F t - 

H 

Simple functions play an important part in integration theory. Many important results 
are proved first for simple functions, and then extended to arbitrary measurable functions by 
taking limits. The next proposition is therefore extremely important: 



Proposition 2.3.3 (a) For any non-negative measurable function (Vl,F) — > K + there 
exists a sequence of simple measurable functions /„, n G N such that < f n t /■ 
Moreover, if f is bounded, we can choose the f n so that f n — >■ / uniformly. 

(b) For any measurable function (Vl,F) — > R, there is a sequence of simple measurable 
functions such that f n f ■ Moreover, if f is bounded, we can choose the f n so that 
fn—*f uniformly. 

Proof: (a) Define 

f n (s) := 2~ n [2 n f(s)} An 
where [x] is the greatest integer < x. This elegant definition deciphers as follows: 

fn '■= ^2 J^{fc2-"</<(fc+l)2-™} + n I{f>n} 
k=0 

which means that 

If f(s) < n, then f n (s) = £ exactly when ^ < f(s) < ^ 
If f(s) > n, then f n (s) = n 

Thus f n is simple and non-negative. Moreover, < f(s) — f n (s) < 2~ n . 

Next, we show that (f n )n is an increasing sequence. If s G S, then there is a unique 
m G N such that m2-( n+1 ) < f(s) < (m + l)2~( n+1 ), i.e. /„+i = m2-( n+1 \ If m is 
even, there is k G N such that m = 2k, in which case k2~ n < f(s) < (k + 1)2 _ ™, i.e. 
f n {s) = k2~ n = f n+ i(s). If m is odd, there is k G N such that 2k + 1, in which case 
k2- n < {2k + l)2-( n+1 ) < f(s) <{k + l)2~ n , i.e. f n (s) = k2~ n < {2k + l)2-( n+1 ) = f n+1 {s). 
Thus, whether m is even or odd, f n {s) < f n +i{s). 

Next, we show that f n {s) — > f{s) for all s G S. If f{s) = +oo, then f n {s) = n for all 
n G N, so certainly f n {s) — > f{s). If f{s) < oo, choose iV such that f{s) < N. If n > N, then 
< f{s) - f n {s) < 2- n , and thus \ f {s) - f n {s)\ < 2~ n . Thus f n {s) /(s) in this case also. 



36 



Pushing Measures along Functions: The Laws of a Random Variable 



Finally, if / is bounded, i.e. / < N for some JVeN, then we see that |/(s) — f n {s)\ < 2~ n 
for all n > N and all s G S, i.e. f n —>f uniformly. 

(b) Now let / be an arbitrary measurable function to R. Then / is the difference of two 



non-negative measurable functions f = f + — f~ (cf. Remarks 2.2.3), and thus, as in (a), 
there exist non-negative simple functions f^ifn such that /+ f / + , /~ f /~ . Clearly then 
also (/+ — /~) — >• /. Now note that if / is bounded, so are f + , f~ ■ If the /+ and /~ converge 
uniformly, then also (/+ — /~) — )• / uniformly. 

H 

Exercise 2.3.4 Suppose that (0, T) is a measurable space, and that A = {A n : n G N} is a 
partition of which generates J-, i.e. <t(^4) = T . (Recall that each element of J- is then a 



union of some of the A n 's — cf. Exercise 1.2.5 ) We show that the measurable functions are 



precisely those which are constant on the blocks A n . 

(a) Show that if / : Q. — > M. is J-"/£>(IR)-measurable, then / is constant on each block A n (i.e. 
if wi,u>2 £ for some n G N, then /(wi) = /(w2)- 

(b) Show, conversely, that if / is constant on each block, then / is J-/B(M.)— measurable. 

□ 

Here follows a very powerful result: 



Theorem 2.3.5 (Monotone Class Theorem) 

Let % be a set of bounded functions from a set Q into M. satisfying the following conditions: 

(i) % is a vector space. 

(ii) The constant function 1 belongs to H. 

(Hi) Given any sequence h n of non-negative elements of % such that h n f h, if h is 
bounded, then h £H. 

Let A be a n -system on £1 with the property that La £ H for every A £ A. 
Then every bounded a (A) -measurable function belongs to %. 



Proof: Let T> = {F C Q : Lp € %}. It is not hard to show that I? is a A-system. By Dynkin's 



Lemma (Thm 1.5.4), V D a (A). 



Let h be a non-negative, bounded a (^l)-measurable function, with upper bound K, i.e. 

< h(uj) < K for all co G ft 

Let h n ,n G N be a sequence of simple (j(^l)-measurable functions such that h n f h. Since 
h is a(^l)-measurable, each A(n,k) G V, i.e. Iji( n ,k) e ^0 Because % is a vector space, we 
now see that h n G % for each n G N. Thus /i G "H as well. 

We have now shown that every non-negative bounded cr(^4)-measurable function belongs 
to %. The same result can be obtained for arbitrary bounded h by splitting into positive and 
negative parts: h = h + — h~ . 



Random Variables 



37 



2.4 Pushing Measures along Functions: The Laws of a Ran- 
dom Variable 

The next proposition shows that measures can be pushed forward along measurable functions. 



Proposition 2.4.1 Suppose that (tt,F) — > (S,S) is measurable, and that \x is a (proba- 
bility) measure on (f2, F). Define a set function on S by 

[nr l ){T) = n[r l [T\) 

Then is a (probability) measure on (S,S). 

□ 



Exercise 2.4.2 Prove Propn. 2.4.1 



□ 

Remarks 2.4.3 If (fl, F, P) R is a random variable, then FX~ 1 is a probability measure 
on (K, S(R)), called the distribution or law of the random variable X. Note that 

(FX-^B = F(X G B) 

□ 



Exercise 2.4.4 (a) Suppose that (Q, F , P) is the die space, i.e. Q = {1, 2, . . . , 6}, F = V(fl) 
and P(cj) = | for all w S fi. Define X : ft -)• R : cj w 2 - 5. Show that X is 
J-7£>(IR)-measurable, and determine law of X, i.e. the measure FX^ 1 on (M, 0(R)). 

(b) Suppose that F : R — > R : x i— )• x 2 . Show that F is a Borel function, and calculate 
AF _1 [— 1,3] (where A is Lebesgue measure). 

□ 

A measure [i on (M, 0(R)) is said to be locally finite iff fi(I) < oo for every compact interval 
I. The next theorem states that there is a one-to-one correspondence between locally finite 
measures and increasing right-continous functions. 



Theorem 2.4.5 (a) Suppose that F : R — )• R is a right- continuous increasing function 
with F(0) = 0. There is a unique locally finite measure fx on (R, Z3(R)) with the 
property that 

[i(a, b] = F(b) - F(a) -oo<a<6<oo 
The measure \x is called the Lebesgue-Stieltjes measure associated with F. 

(b) Conversely, given a locally finite measure n on (M.,B(M)), there is a unique right- 
continuous increasing function F with F(0) = so that 

F(b) - F(a) = n(a, b] -oo<a<6<oo 



Exercise 2.4.6 We prove Thm. 



2.4.5 



38 



Pushing Measures along Functions: The Laws of a Random Variable 



(a) Suppose that F is right-continuous increasing with F(0) = 0. Define a function g : R — > R 



(Recall that inf := oo.) 

(a.l) Show that g(t) < x iff t < F(x), so that g is a generalized inverse of F. 
(a.2) Show that g is increasing and left-continuous, 
(a. 3) Explain why g is a Borel function. 

(a. 4) Define a measure /i on (M,jB(M)) by n := AcT 1 , where A is Lebes gue measure. Use 
(a.l) to show that /x(a, b] = F(b) — F(a) whenever — oo < a < b < oo. 

(a.5) Now prove the uniqueness of \i: Explain why if v is any other measure on R that 
satisfies via, b] = F(b) — F(a) for all — oo < a < b < oo, then u = /x. 

(b) Suppose that fi is a locally finite measure on (R,f?(R). Define 



by 



g(t) 



inf{s G R : F(s) > t} 



t G R 




/x(0,x] ifx>0 
-fjt(x,0] ifx<0 



(b.l) Show that F is right-continuous increasing with F(0) = 0. 
(b.2) Show that /j,(a, b] = F(b) — F(a) whenever — oo < a < b < oo. 
(b.3) Show that F is the unique function satisfying (b.l) and (b.2). 



Chapter 3 

Information and Independence 



3.1 Conditional Probability and Independence of Events 

Probability is all about information. I toss a coin and see that it lands heads. You don't see 
the coin. For you the probability that the coin has landed heads is \, but for me it is 1. New 
information changes the probability measures. 

Let P) be a probability space, and let A, B be events. Knowledge that B has 

occurred can change our estimation of the probability that A has occurred. We write P( J 4|B) 
for the probability that A occurs given that we know that B has occurred. We call P( J 4|B) 
the conditional probability of A given B. 

Example 3.1.1 A die is rolled. Let A be the event that the outcome is a 6, let B be the 
event that the outcome is an even number, and let C be the event that the outcome is an 
odd number. Clearly P(A) = |. However, if we know for sure that the outcome is an even 
number, then the probability of getting a 6 is |, i.e. P(A|S) = |. In the same way, if B 
occurs, then C cannot possibly occur, so although P(C) = \, P(C\B) = 0. 

□ 

Basically, what's happening here is that we have to modify our probability measure to 
accommodate the "new" information that B has occurred. If P( \B) is the new probability 
measure on then we must have F(B\B) = 1 and P(Q — B\B) = 0. HA is another 

event, then A occurs if and only if A n B occurs, since we know that B also occurs, and 
it makes sense to assume that the new probability that A occurs is proportional to the old 
probability that An B occurs, i.e. that P(yl|I?) = cP(A n B) for some constant c. To ensure 
P(B\B) = 1, we must have c = F(B)- 1 . We therefore find that 

m , tln <> P(AnB) 

P(A\B) = - v 



the standard formula given in elementary probability theory texts. 
Exercises 3.1.2 (1.) Prove that P( \B) is a probability measure on (Q, T). 
(2.) For events Ai,...,A n , prove that 

P(Ai n ■ ■ ■ n A n ) = P(Ai)P(A 2 |Ai)P(4»|Ai n A 2 ) . . . ¥{A n \Ai n • • • n A n ^) 



39 



40 



Conditional Probability and Independence of Events 



□ 



Example 3.1.3 A couple has two children. Assuming that boys and girls are equally likely, 
and given that one of the children is a girl, what is the probability that the other child is also 
a girl? 

We can model this probability space as follows: 

Q = {BB, BG, GB, GG} 

g npp(w) = -] 

Let B be the event that at least one of the children is a girl, i.e. B = {GB, BG,GG}, and 
let A be the event that both children are girls, i.e. A = {GG}. Then 

PM.m- going) - 1/4 - 1 

¥{AlB) - ~^{Bj- ~ 374 " 3 

□ 



Two events A, B are said to be independent if knowledge of B tells us nothing about A, 
and vice versa. By this we mean that our estimation of the probability that A occurs isn't 
changed by the knowledge that B has occurred. Thus: 

and hence we have the multiplication law 



P(AnB) = F(A) -F(B) 

The above equation gives us the definition of independent events: 

Definition 3.1.4 Let (0, J 7 , P) be a probability space. A (possibly infinite) set A = {Ai : 
i £ 1} of events is said to be an independent family provided that for any distinct 

F(A h n A i2 n ■ ■ ■ n A in ) = F(A H )F(A i2 ) . . . F(A in ) 



Example 3.1.5 (a) Consider the random trial of tossing coin twice. The sample space 
$7 is the 4-element set {HH, HT, TH, TT} and the associated cr-algebra is just V(£i). 
Intuitively, if the coin is fair, the outcome of the first coin should have no influence on the 
second. Thus knowing that the first coin has landed heads should make no difference to 
whether the second coin lands heads. Let B = {HH,HT} be the event that the first coin 
lands heads, and let A = {HH,TH} be the event that the second coin lands heads. Then 
F{A n B) = F({HH}) = \, and F{A) ■ F(B) = \-\ = \. Thus F(A n B) = F(A) ■ F(B), 
i.e. the events A and B are indeed independent. 



Information and Independence 



41 



(b) Consider the same experiment as in (a), but with one important difference: Before the 
experiment starts, we are told that the coin is unfair. It has either two heads, or two 
tails, but we are not told which. Each possibility is equally likely. 

To model this, we use a different probability measure Q, which has 

®({HH}) = l - = ®({TT}) ®({HT}) = = ®({TH}) 

In this case Q(A n B) = \, whereas Q(A)Q(B) = \. Thus A and B are not independent 
under Q. 

□ 

It's worth pointing out once more that it depends on the probability measure whether 
or not two events are independent, i.e. it is possible for events to be independent under 
one measure, but not under another. The notion of independence is therefore a probabilistic 
notion, which has no analogue in general measure theory. 

Remarks 3.1.6 Can an event be independent of itself, i.e. given an event A, can the events 
A, A be independent? Here we have to be a little careful. From the intuitive point of view, 
the answer would seem to be no, since the information that the event A has occurred will 
certainly make us re-evaluate our estimation of the probability that A has occurred! However, 
if we look at the definition, A and A will be independent provided ¥(A n A) = ¥(A)¥(A), 
i.e. provided P(A) = P(A) 2 . This can happen only if P(A) is either or 1. That's not too 
far removed from our intuition. If P(A) = 1, for example, then A happens almost surely, so 
telling us that A has happened does not really give us any information. We were practically 
certain that it would anyway. 

□ 

Exercise 3.1.7 A gambling game involves the rolling of a fair die followed by the flipping of 
a fair coin. 

(a) Set up a reasonable probability space to model this situation. 

(b) Let A be the event that the die lands on an even number, and let B be the event that 
the coin lands tails. Show that A and B are independent events. 

□ 

Exercise 3.1.8 An breathalizer test for drinking and driving is 95% accurate, i.e. it gives 
the correct result 95% of the time. John lives in a small town with a 1000 inhabitants, about 
50 of whom are drunk on any given evening. One evening, John is stopped by the police, and 
tested. The test says that John is drunk. What is the probability that John is drunk? 

□ 

Exercise 3.1.9 (Monty Hall Problem) 

On a TV game show there are three doors, 1, 2 and 3. Behind one door, there is a brand new 
Porsche, but behind each of the remaining two doors there is a goat. You are a contestant 



42 



Borek-Cantelli Lemmas 



on this show, and you are asked to choose a door. You will win whatever is behind the door. 
The probability of winning the Porsche is therefore |. Say you choose door 1. Now before 
opening door 1, the game show host opens one of the other doors, and reveals a goat. (He 
can always do this, because there are two goats. At least one of doors 2 and 3 must hide a 
goat.) Suppose he opens door 2. He now asks you whether you would like to change your 
mind and opt for door 3. Should you do it? What is the probability that the goat is behind 
door 3, conditional on the above information? 

□ 

3.2 Borel— Cantelli Lemmas 

Let (CI, F, P) be a probability space, and let A = {A n : n G N} be a countable set of events. 
It may help to think of A as a sequence of events, A n+ \ following A n . 



Proposition 3.2.1 (First Borel-Cantelli Lemma) 

Let (CI, T , P) be a probability space, and let {A n : n G N} be a sequence of events. If 

X>(A0 < oo 

n 

then 

F(A n ,i.o.) = 



oo 

Proof: Let B n = (J A^. Then B n i limsup A n = (A n ,\.o.). Hence 

k=n 

oo 

¥(A n ,\.o.) <¥(B n ) <Y,P(An) 

k=n 

for all n 6 N. Now as n — > oo, the right-hand sum goes to zero, since ^2 n ^(A n ) converges. 
Hence F(A n ,lo.) = 0. 

H 



Proposition 3.2.2 (Second Borel-Cantelli Lemma) 

Let (fl, T , IP) be a probability space, and let {A n : n G N} be a family of independent events. 
If 

n 

then 

F(A n ,i.o.) = 1 



Proof: The proof depends on the fact that 1 — x < e x for all x G R, an inequality which is 
easily proved using first-year calculus. Now clearly (A n ,i.o.) = limsupA n = (liminf A c n ) c = 

oo oo 

(^,ev.) c , and thus it suffices to prove that F(A°,ev.) = 0. But (A°,ev.) = (J f] A c k by 

n=l k=n 



Information and Independence 



43 



oo 

definition, and so it suffices to show that P( f] A c k ) = for all n. Now by independence of 

k=n 

the A n , and thus the A c n , we have 



n + m 



n+m n+m n+m _ v FM ~t 

p( n a d = n t 1 - p (^)i ^ n e ~ p(An) = e ~ ^ 

fe=?i fc=n k=n 

for all m G N. Now since Yin P(-^n) diverges, the power of e on the right must tend to zero 

oo n+m 

as m — Y oo. Thus P( f| A£) = lim P( f| AjQ = 0, as required. 

fc=n m ^°° fc=n 

H 

Remarks 3.2.3 The First Borel-Cantelli Lemma says that given events A n , not necessarily 
independent, if the sum of the probabilities P(^4 ra ) converges, then (^4„,i.o.) is an event of 
zero probability. The Second Borel-Cantelli Lemma says that if the A n are independent and 
the sum of the probabilities P(^4 ra ) diverges, then the event (^4 ra ,i.o.) occurs almost surely, 
i.e. with probability 1. Thus for independent events A n , there is no middle road: (A n ,i.o) is 
either an event of probability or an event of probability 1. 

□ 

Exercise 3.2.4 It is sometimes asserted that if a monkey hit the keys of a type writer at 
random, it would eventually produce, in one continuous stream, the complete works of William 
Shakespeare. Prove it. 

□ 



3.3 Information in Random Variables 

We already introduced the notion of a cr-algebra generated by a family of sets. We can use 
this to define the notion of a <r-algebra generated by a random variable. 



Definition and Proposition 3.3.1 (a) Let (S,S) be a measure space, and suppose that 
X is a collection of functions £1 — > S. There is a smallest a-algebra on Q denoted by 

a{X) 

is the such that all X £ X are a(X)/S -measurable. cr(X) is called the a-algebra 
generated by X . 

We also write a{Xi : i £ I) for the a-algebra generated by the family X = {Xi : i G /}. 
(b) If X is a measurable function, then 

a{X) = {X-^T] : T G S] 



Proof: (a) Let 



C = {X~ l [T\ : X £ X,T G S} 



44 



Information in Random Variables 



Then C is a family of subsets of Q, and clearly cr{X) = o~(C). (We already know what is meant 
by ct(C), as C is a family of sets.) 

(b) By (a), cr(X) is the smallest cr-algebra which includes the family C = {X _1 [T] : T G 



S}. However, by Propn. 2.1.3, C is a cr-algebra, and thus C = o~{X). 



In the probabilistic framework, cr-algebras play the role of carriers of information: Earlier, 
we saw that if (f2, F, P) is a probability space, then 

• F is the set containing all those events for which it can be decided whether or not they 
occurred. 

• If C is a family of events, then cr{C) is the set containing all those events for which it 
can be decided whether or not they occurred, given that we can decide all the events in 
C. 

Similarly: 

For a random variable X on a probability space f2, the cr-algebra cr(X) can be interpreted 
in two ways (which are two sides of the same coin): 

• cr(X) is the information carried by X: It is the set of all events that can be decided, 
given that we know value of X. 

• It is the smallest amount of information that we need in order to know the value of 
X. 



Example 3.3.2 For example, consider the experiment of rolling a die, so that £1 = {1, 2, . . . , 6} 
and F = V{0). Let the random variable X : 0, — > M be defined by 



X(u) 

It is easy to check that 



if oj is even 

1 if oj is odd 



o- 



r(X) = {0, {1,3, 5}, {2, 4, 6}, 
Let's consider our interpretations (i) and (ii) above: 

(i) If we know the value of X, all we know is whether the outcome of rolling the die is an 
even number or an odd number, i.e. all we can decide is whether {2,4,6} or {1,3,5} 
occurred (in addition to being able to decide the certain and impossible events). 

(ii) To know the value of X, all we need to know is whether the outcome of the die roll was 
even or odd. We do not need to know the exact outcome of rolling the die. 

□ 

Exercise 3.3.3 Suppose that X : (Q,, F) — > (M.,B(M)) is a function, and that wi,o;2 G Q are 
two elements with the following property: 

For all F E F we have uj\ G F O w 2 G F 



Information and Independence 



45 



Show that if X is ^-measurable, then X(u%) = X{uj2)- Thus if F cannot distinguish between 
oji and u>2, neither can any ^-measurable random variable. 
[Hint: Define x := X(u}±) and consider 

□ 

If X, Y are random variables such that o~(Y) C o~(X), then the information needed to 
determine the value of Y is a subset of the information required to determine the value of X. 
Hence, if we know the value of X, we should also know the value of Y. This suggests that Y 
is a function of X. The following theorem makes this precise. 



Theorem 3.3.4 (Doob-Dynkin Lemma) 

Suppose that Xi,Y : — >• (M, (i = l,...,n) are measurable. Then Y 

is a (Xi, X n ) -measurable iff there is a Borel function W 1 A ]R such that Y = 
h(X\, . . . , X n ). 



Proof: (<s=): We first show that the map X : Q — > E" ' : uj i-)- (Xi(cu), . . . , X n {u>) is 



a(Xi, X n )/S(M n )-measurable. By Propn. 2.1.11 it suffices to check that X" 1 nr=i(-°°> c i\ G 



cr(Xi, . . . , X n ) for all (ci, . . . , c n ) G M n , because the family of these lower orthants generates 
B(R n ). But 

n n 

X- 1 IJ(-oo, a] = p| Xr\-oo, a] 

i=l i=l 

so this is obvious. Now h(X±, . . . , X n ) = h o X is a composition of measurable functions, and 
hence measurable. 

(=^): First assume that Y is simple, i.e. Y = Y2j=i Vj^Aj f° r some family of mutually 



disjoint sets Aj (cf. Propn. 2.3.2). Since Y is assumed to be cr(Xi, . . . , X n )-measurable, we 
see that each Aj = Y^ 1 {yj} belongs to a(Xi, . . . ,X n ). Define X = {X\, . . . ,X n ), as above. 
Reasoning as in Propn. 3.3.1[ it is easy to see that 



A G a(Xi, ...,X n ) iff A = X^B for some B G B(R n ) 
and thus Aj = X^Bj for some Bj G B(M. n ). Now define 



Then h(Xi, . . . ,X n ) = Y, as required. 

Now assume that Y is an arbitrary cr(Xi, . . . , X n )-measurable random variable. Choose 
a sequence of simple random variables Y). (k G N) such that Y^ — > Y pointwise (cf. Propn. 



2.3.3). Hence there exist Borel functions fk such that Y^ = fk{X\, . . . , X n ). Let M = x G 



{fk{x))k converges}. Then M G B(W l ) (e.g. M = g {0}, where g = limsup fc fk— liminffc f, 



k ■ 



Define M n 4 R by 



lim fk{x) if x G M 
k 



{ else 
Then / = lim^ fk^M is a Borel function. Now 

Y(io)=limYk(Lo)=limf k (X 1 (u J ),...,X n ^)) 



46 



Independence of cr-algebras and Random Variables 



which implies two things: (i) (Xi(u), . . . , X n (u))) G M, and (ii) Y = f(Xi,...,X n ), as 
required. 

H 

3.4 Independence of a— algebras and Random Variables 

The intuitive idea about independence was the following: Two events are independent if the 
information that one of the events has occurred does not tell us anything new about the 
other, i.e. it does not lead us to revise our estimation of its probability. Now cr-algebras are 
the carriers of information, and we would therefore like a definition of independence which 
involves cr-algebras. We therefore define independence anew: 

Definition 3.4.1 Let (f^J 7 , P) be a probability space. 

• Sub-cr-algebras Gi,G2, ■ ■ ■ of J 7 are said to be independent if events in distinct Q n 
are independent, i.e. whenever n±,n2, ■ ■ -n m G N are distinct positive integers and 
G ni G Q ni ,G n2 G Q n2 , • • • , G nm G Qn m are events, we have 

F(G m n G n2 n • • • n G n J = ]J P(G n J 

k<m 

• A RV X is independent of a cr-algebra Q iff a(X),G are independent cr-algebras. 

• Xi,X2... are independent RV's iff a(Xi), a(X2), ■ . . are independent cr-algebras 

The basic idea is that two cr-algebras are independent if there is no information about an 
event in one of the cr-algebras that would lead us to revise our estimate of the probability of 
any event in the other cr-algebra. 

Example 3.4.2 Suppose that A, B are events in some probability space (0,7-", P). Then 
A = {Q, A, A c , 0} and B = {0, B, B c , 0} are the cr-algebras of events that can be decided by 
knowledge of A, B respectively. It is easy to show that A, B are independent events if and 
only if A, B are independent cr-algebras. For example, F(A) = F(AnB) +P(AnB c ), and thus 
F(A(~)B C ) = ¥(A)[l-F(B)} = ¥(A)F(B C ), by independence of A,B. It follows that A,B C are 
independent if A, B are. The other combinations of events are similarly proven independent. 

□ 

We return now to the notion of independence. Let ($7, J 7 , P) be a probability space, Recall 
that we have made the following definitions: 

• Events Fi,...,F n £ J 7 are independent iff P(Fi n • ■ ■ D F n ) = Uk=i ^( F k)- 

• Sub-cr-algebras Qi, . . . , Q n are independent iff whenever G Q for k = 1, . . . , n then 
Gi, . . . , G n are independent events. 

• A random variable X is independent of a cr-algebra Q iff a(X),Q are independent cr- 
algebras. 



Information and Independence 



47 



Other variations (e.g. the what it means for random variables X\, . . . , X n to be independent) 
should be obvious. 

If two 7r-systems are independent, so are the cr-algebras generated by those 7r-systems: 



Theorem 3.4.3 Let {Ct}t£T be a collection of independent tt -systems on (£l,J-, P). Then 
{a(Ct)}teT is a, collection of independent a-algebras. 



Proof: We must show that if t\, ... ,t n G T are distinct, then o"(C tl ), . . . ,a(Ct n ) are inde- 
pendent. We proceed by recursion. Fix ti,...,t n G T, define Ft k := o~(Ct k ), and also fix 
Cf 2 G C t2 , . . . , C tn G C tn ■ Let 

V := {F G T tx ■ v(F n c t2 n • • • n C t J = PF ■ PC tl ■ ... • PCVJ 

By assumption, C P. Using the continuity of measure, it is straightforward to check that 



D is a A-system. Thus by Thm. 1.5.4 we have T> = Ft x for every selection of Ct k G Ct k 
k = 2,3, ...,n, and hence the families Ft 1} Ct 2 ,Ct 3 , ■ ■ ■ ,Ct n are independent. Repeat: Fix 
F tl G F tl and Ct 3 G C ta , . . . Ct n G C tn ■ Redefine 

F> := {F G F t2 : P(F h D F D C fe n ■ • • D C t J = PF tl -PF ■ PC ta • ... • PCU 



Again, D is a A-system containing Ct 2 , and hence by Thm. 1.5.4 T> = Ft 2 - From this it 
follows that Ft r , Ft 2 ,Ct 3 , ■ ■ ■ ,Ct„ are independent. Repeat the construction n — 2 more times 
to deduce that Ft x , • • • Ft n are independent. 



If you're familiar with the elementary definition of independence for random variables, 
given in introductory courses on probability and statistics, you will want to know the following: 



Exercise 3.4.4 Random variables X, Y are said to be independent in the elementary sense 
iff 

P(X <x,Y<y)= P(X < x) ■ P(Y < y) all x, y G M 

Prove that random variables are independent iff they are independent in the elementary sense. 
[Hint: First show that set X := < x} : x G m| is a 7r— system which generates cr(X). 

□ 

The following result is very easy to within the measure-theoretic framework, but very 
difficult to prove using the elementary definition of independence: 

Theorem 3.4.5 Suppose that X\, . . . ,X n+m are independent random variables, and that 

R n -U R and R m A E are Borel functions. Then Y = f(Xi,...,X n ) and Z = 
g(X n+ i, . . . , X n+m ) are independent. 

Proof: We see that a(Xi, . . . , X n ) and a{X n+ \, . . . , X m+n ) are independent a-algebras. Now 
Y is a(Xi, . . . , AT n )-measurable, i.e. cr(Y) C a(X±, . . . , X n ). Similarly Z is a(X n+ i, . . . , X m+n )- 
measurable, i.e. cr(Z) C a(X n+ i, . . . ,X m+n ). Thus a(Y) and cr(Z) are independent. 

H 



48 Independence of cr-algebras and Random Variables 



Chapter 4 

Integration and Expectation 



4.1 The Integral: Definition and Basic Properties 



The aim of this section is to define the integral J f d/j, of a measurable function / w.r.t. a 
measure ji. Why do we want this? Because 



Expectation = Integration 



Throughout this section let (S, J 7 , /i) be a measure space, and let mF be the set of all 
measurable functions from (S, J- ) to R. We will define a (partial) linear functional, also 
denoted by fi, or by J ■ dfx, from mJ 7 to R, i.e. 



The quantity [if = J f d[i need not exist for every measurable function /. If it does exist, 
we say that / is integrable. 

For the map fi : mJ 7 — > R to be an integral, we would like it to satisfy the following 
properties: 



I. / 1 a dfj, = fJ,A, i.e. [l!a = fiA. 
II. (Linearity) J A af + (3g dfj, = a J f d/j, + j3 J A g dfj, 

III. (Monotonicity) If / < g then J / d[i < J g d\x 

IV. (Continuity) Suppose that f n — > f. Then J f n d[i — > j f d[i. 
[Actually, we won't quite get this property, but a weaker one.] 



Note that (I.) states that the integral [i is, in some sense, an extension of the measure fi: 
Every measurable set can be identified with a measurable function (the set A is identified 
with the indicator function I a)- The integral J f dfj, = fif can be thought of as extending 
the measure /i from sets to functions. 

The definition of the integral proceeds in three steps: 

Step 1. Define the integral /j, on the set sT + of non-negative simple functions. 





49 



50 



The Integral: Definition and Basic Properties 



Step 2. Extend the definition to the set mj r+ of all non-negative measurable functions. 

Step 3. Finally extend the definition to the set mJ 7 of measurable functions. 

If ip is a non-negative simple function, there is only one way to define the integral to be 
consistent with (I.) and (II.): If ip = Ylk=i a klA k , then define 

/n 
ip dfi = ^a k fiA k 
k=i 

Some things need checking: 

Proposition 4.1.1 (a) The definition of fup doesn't depend on the representation of f 
as a linear combination of indicators, i.e. if ip = J2k a k^A k = J2jbjlBj, then 
Y.k a k Mfc = V B j- 

(b) J I a dfi = fiA, i.e. filA = fiA. 

(c) If ip, ip € sJ r+ and a, f3 > 0, then f oup + j3ip dfi = afipdfi + /3fip dfj,, i.e. fi{pnp + 
i3ip) = a fap + 13 flip. 

(d) if ip < ip € sS + , then J ip dfi < f ip dfi, i.e. flip < flip. 

Proof: We may assume that the a k are all distinct from each other, and that the bj are all 
distinct from each other. Thus A k ,Bj £ cr(p), the cr-algebra generated by (p. A little hows 
that there is a representation ip = c m Ic m of (p such that (C m ) m forms a partition of S, 
and such that each A k and Bj is a union of some C m 's.- just let the C m 's be the blocks of the 
partition that generates a(ip). In particular, for each k, m, either A k n C m = 0, or C m C A k . 
A similar statement holds for the Bj. Also c m = J2ki a k '■ C m Q A k }. 

(a) We have 

^2 a k Mfc = ^2 a k^2 ^ Ak Cm "> = X] a kK A k n C m ) 
k km m k 

= ^2 nC m : C m C A k } = ^2 c m LiC m 

m k k 

(b) is obvious. 

(c) Suppose that tp = J2k a klA k ,ip = Ylj bjlBj, where (A k ) k , (Bj)j are partitions of S. Then 
<P + V> = + bj)lA k nB j and hence 

fi(ip + ip) = ^2{a k + bj)/i(A k n Bj) = ^2a k fiA k + ^ bj fiBj 

k,j k j 

(d) is obvious. 

H 

If / is a non-negative measurable function, then (III.) requires that we must have J f dfi > 
f ip dfi whenever tp is simple, with / > ip. We also know that there is a sequence ip n of 
simple non-negative functions such that ip n f /• (IV.) dictates that we should then have 
j ip n dfi — > j f dfi, and (III.) that lim n J <p n dfi = sup ra J ip n dfi. The most parsimonious 
choice, therefore, is to define: 



Integration and Expectation 



51 



Definition 4.1.2 



J f dp = sup | J ip dp, : f > <p G rcuS + j 
Note that pf may be equal to +00. 



Exercise 4.1.3 (a) If ip = Ylk a k l^^k is non-negative simple, it is also non-negative mea- 
surable, and thus we now have two definitions of pap namely 

flip = a k fJ-Ak and flip = sup{pip : <p> ip G sJ r+ } 
k 

Show that these two values of pup coincide, 
(b) Verify (III.) for non-negative measurable functions, i.e. show that if / < g € mT + , then 

□ 

Proving that the integral is still linear, i.e. that (II.) holds is much more difficult, and 
requires a version of (IV.) In fact, a weak version of (IV.) forms the foundation for the whole 
edifice of integration theory: 



Theorem 4.1.4 (Monotone Convergence Theorem, MCT) 
Suppose that f n , f G mJ 7+ such that f n f /• Then pf n f pf, i.e. 

t lim f n dpi =| lim / /„ dp 

n n _J_ 

Proof: It is easy to see that (pf n )n is an increasing sequence, and that each pf n < pf, so 
that lim n pf n exists (in the extended reals) and lim n pf n < pf. 

Let / > <p G sJ r+ , and suppose ip = J2 k a klA k , where the A k are disjoint, and each a k > 0. 
For e > 0, define 

<Pn = J^ 1 ~ £ ) a k I A k n{f n >(l-e)a k } 
k 

Then ip n is a non-negative simple measurable function with ip n < f n . Hence 

Vfn > Wn = (1 - e) ^2 a kK A k n {/„ > (1 - e)a k }) 
k 

Note also that 

A k n {f n > (1 - e)a k } t A fe 

for if s G Afc, then a k = f(s) = lim n / n (s), so that / n (s) > (1 — e)a k if n is sufficiently large, 
and thus s G A k n {/ n > (1 — e)afc} if n is sufficiently large. By continuity properties of 
measures, 

p(A k n {f n > (1 - e)a fe }) t Mfc as n ->■ oo 

which in turn yields 

pupn t (1 - e) y^Qfc Mfc = (1 - e) /^V 3 as n -)• oo 



52 



The Integral: Definition and Basic Properties 



Now fif n > fj,(p n for each n G N, and thus 

lim^/ n > (1 - e) \up 

n 

This is true for any non-negative simple (p < f and any e > 0. Taking the supremum over 
those cp, we see that 

lim/i/„ > (1 - e) suy>{h<p :/>^€ = (1 - e) [if 

n 

Letting e — > 0, we conclude that lim ra yu/ n > [if. 

H 



In Propn. |4.1.1[ b), Exercise [4L3t b) and Thm |4.1.4[ we have seen that (I.), (III.) and a weak 



version of (IV.) hold. We have also verified (II.) for non-negative simple functions (cf. Propn. 
IXTfc )). Now we can verify that (II.) holds for non-negative measurable functions: 



Proposition 4.1.5 If f,g G m~F + and if a, f3 > ; then [i{af + fig) = a [if + (3 fig. 



Proof: Choose sequences (if n )n, (4>m)m of non-negative simple functions such that (p n f /, 
ip n t g- Then each aip n + f3ip n is non-negative simple, and {pup n + f3tp n ) t ( a f + fid)- Since 
(II.) holds for simple functions, and by the Monotone Convergence Theorem, we see that 

fi(af + f3g) = lim fi(aip n + /3ip n ) = a lim fnp n + j3 lim fiif) n = a [if + f3 fig 



It remains to define the integral for arbitrary measurable functions. Recall that if / G mJ 7 , 
then / = /+ - /", where /+ = / V 0, f~ = -f A = (-/) V 0. Since /+, /" G mJ+, the 
integrals fif + ,fif~ have already been defined. If we want to preserve linearity, we therefore 
must define fj,f by 

/ dfi = f + dfi - I f~ dfi 



However, here we face a problem: If both fif + , fif are equal to +oo, we have fif = oo — oo, 
an indeterminate form. 

Definition and Proposition 4.1.6 A function f G mJ 7 is said to be //-integrable iff 
fi\f \ < oo. The class of all fi-integrable functions is denoted by £ 1 (5,5,/i). 
If f G £ 1 (S', J 7 , /x) ; we define 

Then f is integrable iff fif + , fif~ < oo. 



Proof: Note that |/| = / + + / , so fi\f\ is finite iff both fif + ,fif are finite. 

H 

Definition 4.1.7 If / is an integrable function and A is a measurable set, we define 

J fdfjt:= J fI A dfi=: fi(f;A) 
to be the integral of / over the set A. 



Integration and Expectation 



53 



Remarks 4.1.8 Later, we will prove the following important fact: If the Riemann integral 
f(x) dx of a function K ^4 M exists, then 



where A is Lebesgue measure on (M, i3(M)). If the Riemann integral of a function exists, then 
so does the Lebesgue integral, and the two integrals coincide. This is obvious if / is a simple 
function, as you can easily check, but the proof for general / is deferred to a later subsection. 
Note that the Lebesgue integral may exist even when the Riemann integral does not. 



Obvious, but often useful, are the following facts: 

Proposition 4.1.9 (a) If f is measurable, then f is integrable iff \f \ is integrable. 

(b) If f,g are measurable, g is integrable and \ f\ < g, then f is integrable as well. 

(c) If f is integrable, then //{/ = ±00} = 0, i.e. f is finite /i-a.e. 

(d) If f is integrable, then \ f f dfj\ < J \ f\ dfi. 

Proof: (a) is obvious, (b) follows from the fact that fi\f \ < fig < 00 (because we have (III.), 
monotonicity, for non-negative measurable functions). 



(c) Let A = {s G S : \f(s)\ = 00}. Then nI A < \ f\ for all n G N, and hence n fxA < fi\f\. 
Letting n — > 00, we see that we must have fi\f \ = +00 if \xA > 0. 

(d) follows because \fj,f\ < \^f + \ + \^f~\ = lAf\- 



Remarks 4.1.10 Above, we have defined the integral \xf only for / G C 1 (S , J- , fj.) , with the 
result that —00 < fj,f < 00, i.e. \if is a finite number.. Before, we defined the integral for 
arbitrary g G mJ 7+ , but might then have fig = +00. The restriction to C 1 is to prevent having 
to deal with the indeterminate form 00 — 00. However, 00 — c and c — 00 are perfectly fine if 
c / 00. So we sometimes can define the integral of a measurable function / in an extended 
sense: If nf + = 00, but [if~ = c < 00, then we say that [if = 00, for example. Nevertheless, 
such an / is not integrable. 



Exercise 4.1.11 The decomposition f = f + — f~ is but one of many ways that / can be 
decomposed as a difference of non-negative measurable functions. Show that if / = g — h is 
a difference of non-negative functions, then \xf = fig — fih. Thus the definition of the integral 
°f f is independent of the representation of f as a difference of non-negative measurable 
functions. 




f(x) dx 




□ 



□ 



[Hint: Apply Propn. 4.1.5 to /+ + h = g + /".] 



□ 



54 



Dominated Convergence Theorem 



Looking at our wish list of properties, i.e. (I.)-(IV.), we see that (I.) holds automatically. (III.) 
(monotonicity) is easy: If / < g, then / + < g + and /~ > g~ , so /j,f + < fig + and fj,f~ > fig~ 
(because (III.) holds for non-negative measurable functions, cf. Exercise 4.1.3[ b)), and hence 
M/ < W- 

We finish this subsection by dealing with (II.) (linearity), and leave (IV.) (continuity) to 
the next section. 



Theorem 4.1.12 If f,g G C l (S,F,n), anda,(3eR, then 

K a f + Pd) = a + f3 fig 

Proof: It suffices to prove that fi(f + g) = fif + fig and that fi(af) = a fif (for f,g £ C 1 
and aEl). Now 

f + 9 = (f + + g + )-(r + 9~) 
is a representation of / + g as a difference of non-negative measurable functions. By Exercise 



4.1.11 



it follows that //(/ + g) = /u(/ + + g + ) — //(/ + g ). Propn. 4.1.5 implies that 



M/ + 9) = + 



Similarly, an application of Propn. 4.1.5 and Exercise 4.1.11 to af = af + — af (if 



a > 0), or a/ = (— a)f — (— a)/ + (if a < 0) yields the conclusion that fi(af) = a jif. 



Exercise 4.1.13 Show that C 1 (S,T,fi) is a vector space. Also give an example to show that 
it may not be closed under multiplication. 

□ 

4.2 Lebesgue's Dominated Convergence Theorem 

The following proposition serves as stepping stone in the proof of the Dominated Convergence 
theorem, but is also very useful in other situations. 



Proposition 4.2.1 (a) FATOU'S LEMMA: If f n G mJ + for n G N, then 

^(liminf/ n ) < liminf/// n 

n n 

(b) REVERSE FATOU LEMMA: Suppose that f n G mT + forn G N, and that there exists 
a g G C l (S,J-,fi) such that each f n < g. Then 

limsup^/„ < /u(limsup/ n ) 



Proof: (a) Let / = liminf n f n , and define g n = inf m > n f m . Then g n f /, and so the Monotone 
Convergence Theorem implies that fig n t V-f ■ Moreover, fig n < inf m > n fif m (by monotonicity, 
(III.)), and so fif = lim n fig n < lim n inf m > n fif m = liminf n fif n . 

H 

Exercise 4.2.2 Prove the Reverse Fatou Lemma by applying Fatou's Lemma to the sequence 
g — fn- Why do we require that g G C x 1 Cancellation! 



Integration and Expectation 



55 



□ 

Remarks 4.2.3 Under suitable conditions, we see that we have 

fi lim inf /„ < lim inf fif n < lim sup fif n < fi lim sup f n 

n n n n 

This provides a useful mnemonic: The terms with the limits on the outside (of the integral) 
are on the inside (of the string of inequalities). 

The mnemonic Terms with limits on the inside are on the outside also works. 

□ 



Theorem 4.2.4 (Dominated Convergence Theorem, DCT) 

Suppose that f±, /2, f'3, ■ ■ ■ is a sequence of measurable functions on (S, J 7 , fi) such that 

(i) lim„ f n (s) exists for all s G S; 

(ii) There is a g G ^(S, J 7 , ft) such that \ f n \ < g for all n G N. 
Then the function f = lim„/ n is in £ 1 (S', J 7 , /i), and 

fif = lim fif n i.e. / lim/ n dfi = lim / /„ dfj, 



Proof: Since \f n \ < g, the functions g ± /„ are non-negative measurable functions, and thus 
by Fatou's lemma, we see that 

Lig + liminf(±/x/„) = lim inf (i(g ± f n ) > //(lim inf ± /„)) = fi(g + f) = fig ± (if 

n n n 

Subtracting fig < 00 from both sides, we see that liminf n fif n > fif and that liminf n (— fif n ) > 
—fif, and thus that limsup n /// n < fif. Combining, we obtain 



fif < lim inf fif n < lim sup fif n < fif 

n n 

Exercise 4.2.5 (1.) Let f n = ^I[ , n ] for n G N. 

(a) Show that /„ — > as n — > +00. 

(b) Show that / f n dX = 1 for all n G N. 

(c) Why does this not contradict 

(i) the Monotone Convergence Theorem? 

(ii) Fatou's Lemma? 

(iii) the Lebesgue Dominated Convergence Theorem? 
(2.) Let f n = nJ( 0j i] for n G N. 

(a) Find the function lim f n . 

n— >+oc 

(b) Show that lim f f n dX / f lim f n dX. 

(c) Why does this not contradict the Lebesgue Dominated Convergence Theorem? 

□ 



56 



Measure Zero 



4.3 Measure Zero 

Suppose that (f2, J 7 , //) is a measure space. It may be possible to extend the measure // to a 
class of sets larger than J 7 , where the measure of the added new sets is determined by /J, and 
F. For example, suppose that 

(i) F £ F is such that \iF = 0; 

(ii) A C F 

then "clearly" /xA = also. However, if ^4 J 7 , then \iA isn't defined. Yet, \iA "ought" to 
be zero. By adding all those sets whose measure "ought" to be zero, we get a new c-algebra 
F, called the completion of F w.r.t \i. 

Definition 4.3.1 Let (£l,F,fi) be a measure space, and let A C D,. 

(a) We say that A is \i-null if there exists B £ F such that iC5 and /ii? = 0. 
(It is not necessary that A £ F.) 

(b) The measure space (fl, F, (J,) is said to be complete iff every /i-null set is measurable, 
i.e. belongs to F. 

Exercise 4.3.2 Show that a countable union of //-null sets is /z-null, i.e. that if N n are 
//-null sets, for n G N, then (J n iV n is also a /i-null set. 

□ 



Definition and Proposition 4.3.3 Let (Tl,F,fj,) be a measure space. Let 

M :={N CQ-.3F £ F[/J,F = A N C F]} 



6e i/ie se£ of [i-null sets (cf. Definition 4-3.1). 

(a) The family of sets 

F := {FUN : F £ F,N G A/]} 
zs a a-algebra, called the completion of F w.r.t. fi. 

(b) We have 

G £ F iff there are F U F 2 £F such that F t C G C F 2 and = //(F 2 ) 

(c^) We con extend the measure [i to a measure fl on the a-algebra F in the obvious way: 
If G = FU N, where F £ F,N £ M, define /2(G) := fi(F) 

(d) The space (Q,F,jl) is complete. 



Proof: (a) We first show that J 7 is a cr-algebra. That F is closed under countable unions 
follows straightforwardly from the fact that both F and J\f are closed under countable unions. 



Integration and Expectation 



57 



To check that F is closed under complementation, suppose that FUN G F, where F G F, N G 
N. Choose G £ F such that liG = and N C G. Then 

(F U N) c = (FU G) c U[G-(FU TV)] 

Now (FUG) C G J", andG-(FUiV) G A/" (being a subset of G) . Hence (FUA) C G ^ proving 
that F is a u-algebra. Clearly J" = a(FUAT). 

(b) If Fi, F 2 G J 7 are such that pFx = LiF 2 , and ifFjCGC F 2 , then G = Fi U (G - Fi), 
where (G - Fi) C (F 2 — Fi), so that G- F 1 eM. It follows that GeJ. 

Next, if G G F, then (by definition of F) there is F 1 G F, A € AA such that G = Fi n A. 
Also, there is F G F such that /xF = and G C F. If we now define and F 2 := Fi U F, we 
see that Fi,F 2 G F, with F x C G C F 2 , and //Fi = liF 2 . Thus 

J = {GCS]: 3Fi, F 2 G F(Fi C G C F 2 A /i(F 2 - Fi) = 0)} 

(c) We need to verify two things: That the extension Jx of li is well-defined on F, and 
that it is a measure. To see that it is well-defined, suppose that G = F\ U N\ = F 2 U A 2 are 
two representations of G, where F\,F 2 G F, N±,N 2 G A/". We must show that //Fi = liF 2 . 
But Fi = Fi n (Fi U JVi) = FiH (F 2 U A 2 ) = (Fi n F 2 ) U (Fi n A 2 ). It follows easily that 
jiFi = fi(Fi fl F 2 ). Similarly //F 2 = n(F-y n F 2 ), and hence /iF x = /i(Fi n F 2 ) = llF 2 . 

Next, we show that p, is a measure on F: Suppose that G n , (n G N) are mutually disjoint 
members of F. Choose F n G F, A n G N such that G n = F n U A n (for n G N) . Then 
U n G„ = FUN, where F := \J n F n G F and TV := \J n N n G AA (because a countable 
union of /U-null sets is //-null). Then by definition of the extension of /t on F, we have 
jl \J n G n = liF = /t |J n Fn = M-^n = A^nj where we used the fact that /x is a measure 
on F to deduce that li \J n F n = J2 n 

(d) Suppose that TV is a null set for (Q,F,p,). Then there exists G G F such that A C G 
and such that /t(G) = 0. There exist therefore a F G F G C F and /tF = 0. Putting all this 
together, we see that A C F and /tF = 0, and thus that TV G AA C F. Thus every /i-null set 
belongs to F, as required. 

H 

Exercise 4.3.4 Show that if (S,F,fi) has completion (S, F, fi), then 

F = {A C 5 : AAF is a //-null set, for some F G F} 

[Hint: Let M be the family of null sets, let Q = {A C 5 : AAF G AA for some F G F}, and 
let F = a(Fl)Af) = {FU A : F G F, A G A/]}. First show that F,M QG, and conclude that 
FCC?. Next, note that if AAF G A" for some F G F, then A = (F — (F — A)) U (A — F), 
where F G F and F - A, A - F G M. 

□ 



Definition 4.3.5 We shall say that a statement <E> holds Li-almost everywhere (or //- 
almost surely if // is a probability measure), if the set {oj G : 3>(u;) is not true } where 
<I> fails to hold to hold is a //-null set. 



58 



Measure Zero 



First note that completing a measure space does not create any interesting new measurable 
functions: 



Proposition 4.3.6 Let {S,J-^,ijl) be the completion of (S,S, fi). Then a function S — > 
is J 71 * -measurable iff there is an T '-measurable function 54l such that f = g fi-a.e. 



Proof: (=>): First suppose that / = I a is an indicator function. By Exercise 4.3.4 we know 
that = {A C S : 3F G F(AAF is //-null)}. So if is ^-measurable, then I A = I F 
//-a.e. for some F G T , where then Lp is J^-measurable. 

It is now straightforward to see that the proposition holds for simple functions as well. 

If / is an arbitrary -measurable function, we may choose a sequence /„ of simple 
J-v -measurable functions such that f n — > f. Then choose simple 5-measurable functions g n 
such that f n = g n fj,-&.e., for all n G N. Let g = limsup n g n . Then f = g /i-a.e. (because 
{s G 5 : f(s) / g(s)} C \J n {s G S : f n (s) ^ 9n(s)}, a countable union of null sets). 

(•4=): Suppose that f = g pi-a.e. for some 5-measurable g. If B is a Borel set, then 
f- 1 (B)l^g- l {B) C {s G S\f[s) ^ g(s)} is a /i-null set. Since c/ _1 (^) G ^ we see that 



/ -1 0B) G 5^ (by Exercise 4.3.4) 



H 

Remarks 4.3.7 If / is .F-measurable and if / = g pi-a.e., we cannot generally conclude that 
g is also 5-measurable. That conclusion is valid, however, if T is complete w.r.t. pi, i.e. if 
T = T^. 

□ 

Next note that two functions which are equal fi-a.e. have the same integrals. 



Lemma 4.3.8 On (S,T,fi), if h > is measurable, then [ih = iff h = fi-a.e. 



Proof: The statement is obviously true if h is simple non-negative. For general h G mJ 7+ , 
choose simple h n such that < h n "f /t. If ^/i = 0, then by the MCT, < /x/i n < /x/i = 0, 
so that, by the above, h n = fi-a.e. Thus also /i = lim n h n = /i-a.e. Conversely, if ft. = 
//-a.e., then also h n = //— a.e., and hence /x/i = lim n /i/i n = 0, by the MCT. 



Theorem 4.3.9 On (S,F,fi), if f,g are measurable functions such that f = g /i-a.e. ; 
and if f is integrable (in the extended sense), then g is integrable (in the extended sense), 
and jxf = \xg. 



Proof: We have < \fif — fig\ < fx\f — g\, by Propn. 4.1.9 But f = g /i-a.e. iff |/ — g\ = 



//-a.e., so Lemma 4.3.8 shows that < — fig\ < 0. 



We can use this to improve the convergence theorems. For example: 



Integration and Expectation 



59 



Theorem 4.3.10 (Dominated Convergence Theorem) 

Suppose that fi, $2, fo, • ■ • is a sequence of measurable functions on a complete measure 
space (S, F, fj) such that 

(i) lim n / n (s) exists for [i-&.e. s G S; 

(ii) There is a g G C l {S,F,fi) such that |/ n | < g fi-a.e. for all n G N. 

Define f by f(s) = lim n f n (s) if this limit exists, and let f(s) be arbitrary otherwise. Then 
f G £ 1 (S', J 7 , and 

li f = lim/x/ n 

n 

Proof: Let 

N = {s G S : lim/ n (s) does not exist} U {s G S : |/ n (s)| > g(s)} 

n 

Then N is a null set, and thus in T (because the measure space is assumed complete). Define 

fn = fnlN" 9 = glN? f = flN c 

These functions are also ^-measurable, and we have \xg = fig < oo, and 
lim/ n (s) = f(s) \f n (s)\ < g(s) for all s G S 



By Theorem 4.2.4, we can conclude that / is integrable, and that [if = lim ra fj,f n . But [if = [if 



and [if n = [if n , by Theorem 4.3.9 



4.4 Riemann Integral vs. Lebesgue Integral 

We recall briefly how the Riemann integral f(t) dt is defined: Let / be a real-valued 
function defined and bounded on an interval [a, b]. A partition P of [a, b] is a a finite ordered 
set {a = to < t% < t2 < • • • < t n = b}. The size of such a partition is denote cr(P), and 
defined by 

a(P) := max(t k - t k -i) 

k 

A tagged partition is a partition P together with a choice t* k G [tk-i, tk] f° r each k = 1, . . . , n. 
Tagged partitions will be indicated by a *, i.e. if P is a partition, then P* dentes an associated 
tagged partition. 

With each tagged partition, we can associate a Riemann sum 

n n 

S{P\f) :=Y,M) (tk-tk-i) = Y.f( f k) A ^ 

fc=l k=l 

The Riemann integral J b f dt should be the limit of the Riemann sums, over all tagged 
partitions P*, as cr(P) — > 0. To be precise, we say 

lim S(P*, f) = L exists 



60 



Riemann Integral vs. Lebesgue Integral 



if and only if for every e > there is 5 > such that 

\S(P*,f) -L\<e whenever a(P) < 5 

Then we define 

I fdt:= lim S(P*J) 

Ja °(P)^0 

provided this limit exists, and say that / is Riemann integrable with respect to G on [a, b]. 

With each partition {a = to < t\ < ■ ■ ■ < t n = b} it is possible to associate three natural 
tagged partitions, namely those having tags equal to the left endpoint, right endpoint and 
midpoint of each interval. This yields: 

• The lefthand Riemann sum Y2 k f{pk— l)Afct; 

• The righthand Riemann sum f(t k )A k t; 

• The symmetric Riemann sum ^ k f{ tk ~]^' tk )A k t. 

If / is Riemann integrable over [a, b], then each of these sums must converge as cr(P) — > 0, 
and all to the same limit. 

Remarks 4.4.1 A slightly different definition uses Darboux sums rather than Riemann 
sums. Given a real-valued functions / defined and bounded on an interval [a, b] , and a 
partition P = {a = to < t\ < ■ ■ ■ < t n = 6}, let the upper and lower Darboux sums be defined 
by 



U(P,f) :=^sup{/(t) : t G fa_i,t fc ]} • (t*-ifc_i 

k=l 

n 

L(P, f) := inf{/(t) : t G [t k -i,t k ]} ■ (t k - t k 



fc=l 

If / is continuous on [a, b], it attains its supremum and infimum on each subinterval, i.e. we 
can choose £™ ax ,£™ in G [t k -i,t k ] such that 

/(tr X ) = sup{/(t) : t G [t fc _i,t fc ]} /(tf n ) = inf{/(i) : t G [i fc _i,t fc ]} 

j£ p*max p*mm are ^ a gg ec i partitions given by a = to < •■■ < t n = b and the tags 
t™ ax ,t™ m respectively, then it is easy to see that 

U(P,f) = S(P*™ x ,f) L(P,f) = S(P*™ n J) 

i.e. the Darboux sums give the most extreme values of the Riemann sums for any given 
partition. However, the Darboux sums may differ from Riemann sums if / is not continuous. 

The Riemann sums may be defined even when / is Banach space-valued, however, whereas 
the Darboux sums, being dependent on sup's and inf's, make sense for real-valued functions 
only. 

□ 

From calculus, we know that the Riemann integral f dt exists when / is continuous 
(or even piecewise continuous) on [a, b] — cf. also Thm. |4.4.4 When the function is too 
discontinuous, we run into trouble, however: 



Integration and Expectation 



61 



Example 4.4.2 Consider the Dirichlet function 



1 if t £ 
else 



where Q is the set of rational numbers. If P = {a = to < ti < ■ ■ ■ < t n = b} is any partition 
of [a, b], no matter how fine, we can always find tags t k ,t' k £ [ifc_i,tfc] so that t* k is rational, 
and t' k is irrational. Thus lQ(t k ) = l,I<Q(t' k ) = 0. It follows that 

S(P*, f) = 1 • (*fc - *k-i) = 1-0 = 1 S(P', /) ■ (** - tk-i) = 
it fc 

and thus S(P*, /), /) cannot be made to lie arbitrarily close to each other, no matter 
how fine the partition P. Thus lim CT (p)_^ S(P, /) does not exist. 

□ 



When the Riemann integral is first encountered in calculus, it is taught as "the area under 
a curve": If / > is continuous, then f^f dt is the area under the curve described by /, 
between t = a and t = b. For A C R, define the indicator function of A by 

I A {t) :-- 




Consider now the Iq, where Q is the set of rational numbers. This function is very discontin- 
uous. If we try to compute this "curve" over the interval [0, 1] using the Riemann integral, 
we run into trouble: The Riemann integral Jq Iq dt does not exist. 

We can make a convincing argument that the area under the curve over the interval [0, 1] 
should be zero, as follows: Use the fact that Q is countable to enumerate the rational numbers 
in [0, 1], i.e. write [0, 1] n Q = {q n : n £ N}. For any e > 0, define 

B n := [q n - z£+r,q n + ^] for n G N, / = I\j n Bn 

The area under the curve of / is made up of (possibly overlapping) rectangles of height 1 
centered at the rational numbers. Thus the area under / over [0, 1] is < J2 n I - (length of B n ) = 
J2 n ^ = £ - ^ is also clear that < Iq < f, and thus that the area under Iq is less than the 
are under /, i.e. that the area under Iq is < e. Since this is true for any e > 0, we conclude 
that the area under Iq is 0. 
Thus we have the following: 

l 

Iq dt is undefined, but it should be zero 

o 

The Riemann integral is simply not powerful enough to handle functions like Iq. 

You may counter that a function such as Iq is pathological, and unlikely to be encountered 
in practice. It is true that we chose it here simply to make a point. However, the following 
example should cause you to feel uneasy about the assertion that Iq is "pathological" : 



62 



Riemann Integral vs. Lebesgue Integral 



Example 4.4.3 Consider the function g(t) defined by 



2» 



g(t) = lim lim cos(m!7rt) 

m— >co n— >oo 

If t is a rational number, i.e. t = - where p E Z, q E N, then ml t E Z for all m > q. 
Consequently, cos(m!7ri) = ±1 for all m > q, and thus cos(m!7ri) 2?1 = 1 for all n and all 
m> q. It follows that g(t) = 1 when t is rational. 

On the other hand, if t is irrational, then < | cos(m!-7r£)| < 1 for all m, and so 
< cos(m!7ri) 2n < 1 for all m. Now if < x < 1, then x n — >■ 0. It follows that 
lim n _ >00 cos(m!7rt) 2ri = for all m, and thus that g(t) = when t is irrational. Hence 



In = lim lim cos(m!7ri) 2n 

m— >oo n— >oo 



The "pathological" function In therefore appears as a limit of perfectly ordinary functions. 

□ 

Unlike the Lebesgue integral, the Riemann integral does not handle limits well. We now 
show that when the Riemann integral exists, then the Lebesgue integral (w.r.t. A) does too, 
and the integrals' values coincide: 

Theorem 4.4.4 Let f be a bounded real-valued function on the compact interval [a,b]. 
Then 

(a) f is Riemann integrable iff f is continuous A-a.e. 

(b) If f is Riemann integrable, then f is Lebesgue integrable, and the integrals are equal: 
„b] 



Iafdt=L b] fd\. 



Proof: Assume that / is Riemann integrable. Then we can choose a sequence P n of suc- 
cessively finer partitions of [a, b] such that U(f,P n ) — L(f,P n ) < ^. Define functions g n ,hn 
on [a, b] as follows: For each re, g n (a) = h n (a) = f(a). If P n = {a = t$ < t\ < t% < ■ ■ ■ < 
tm n = b}, then g n ,h n are step functions, with steps determined by P n , defined as follows: If 
t E [a, b], then t E (t^_ 1 , ijj] for some k, and we define 

g n {t) = mf{f(x) : K{t) = inf{/(x) : < x < t n k } 

Then g n , h n are clearly simple Borel functions, designed so that 

I g n d\ = L(f,P n ) [ h n d\ = U{f,P n ) 
J[a,b] J[a,b] 

Moreover (g n ) n is a bounded increasing sequence, with g n < f, and (h n ) n is a bounded 
decreasing sequence, with h n > /. Define g = lim n g n ,h = lim n /i n . Then g, h are Borel 
functions, and by the DCT we have Jj a b ^g dX = lim n L(f, P n ) = f a fdt and Jj a 6 j h dX = 

lim n U(f, P n ) = f dt. Hence / „ h - g dX = 0. 



Now since h > g, Lemma 4.3. 8| implies that h = g A-a.e. on [a, bj. Since g < / < h, we 
must have g = f = h A-a.e., and thus J f dX = J g dX = lim n L(f, P n ) = J f dX. This 
proves (b). 



Integration and Expectation 



63 



Next, note that if t G" \J n P n , and if h(t) = g(t), then / is necessarily continuous at t: For 
then g(t) = f(t) = h(t), i.e. 

hminf{/(x) : ^ <x< t n k J = f(t) = supinf{/(z) : ^ < x < t n k J 

n n 

(where k n is such that t G ]) an< ^ thus au va l ues of must lie close to f(t) if x 

is close to t. Hence any discontinuity of / must belong to {J n P n U {t : g{t) / h(t)}, a set 
of A-measure zero. This shows that if / is Riemann integrable, the / is continuous A-a.e., 
proving one direction of (a). 

Conversely, suppose that / is continuous A-a.e. Let P n be a partition of [a, b] that divides 
it into 2 n subintervals of equal length, and construct simple Borel functions g n , h n as above. 
If / is continuous at t, then obviously lim n g n (t) = f(t) = lim n h n (t). Hence lim n (/t n — g n ) = 
A-a.e. By the DCT, we see that = lim n h n - g n d\ = lim n (£7 (/, P n ) - L(f, P n )), from 
which Riemann integrability easily follows. 

H 

Remarks 4.4.5 An oft-used fact in calculus is that ^ f(x,t) dx = f£ §if{x,t) dx, pro- 
vided that ^ is bounded — differentiation under the integral sign. Differentiation involves 
the taking of limits. This can be justified via the DCT. 

Let G(t) := f% f(x,t) fi(dx). We want to show that G'(to) = Ja^( x ^o) ^(dx), under 
certain commonly satisfied conditions: Suppose that there exists a /U-integrable function M(x) 
such that \^ft(x, t)\ < M(x) for all x, and all t G (to — <Mo + S) (where S > 0). Let (h n ) n be 
a non-zero sequence of reals such that h n — > 0, and such that each \h n \ < 5. Then 

_ Um Gto + K>-G M _ ljm r I* feW^jjl 1 _ Um I* ( 

n h n n lj a h n In J a 

where g n (x) := /(^o+y "/(^o) _ N t e that g n (x) — > ^(x,to)- We claim that the sequence 
g n is dominated by M. Indeed, by the Mean Value Theorem, there is, for each x and each 
n G N, at" G (to — |^n|> *o + l^n|) C (t — 5, to + 5) such that g n (x) = §{(x,t"), and thus 
I'M (#) | — M(x). Since M is /U-integrable, lim n J g n (x) fi(dx) = J]im n g n (x) fJ,(dx), and we 
are done. 

□ 



4.5 Chain Rule, Change of Variables 

Here is another way of obtaining new measures from old: 

Definition and Proposition 4.5.1 Suppose that (S, J 7 , n) — > (R, £>(R)) is a non- 
negative measurable function. Define a set mapping f • fj, : T — > M by 

(f-ti)A:= [ fdn = ntfI A ) 
J A 

Then v = f ■ fi is a measure on (S, F) . 

f is called the //-density of v, and also written as the Radon-Nikodym derivative / = Tjjr- 



64 



Chain Rule, Change of Variables 



Proof: We need only check that / • /i is countably additive. Suppose that A = (J n A n is a 
union of a family of mutually disjoint members of T . Put f n = Ylk<n f^A k - Then /„ f /Ia, 
and so /x/ n f A*(/ j a) = (/ ■ m)A by the MCT. But /x/ n = £&< n M/£l fc ) Efc<n(/ - a0^> and 
thus /i/ n f Ylkif ' A*)^A ( as n ~~ ^ °°)- We conclude that (/ • ^)j4 = Efc(/ ' AO^fc- 



The following proposition explains the notation dr 



Proposition 4.5.2 (Chain Rule) 

1 



On (S,F,n) and 541 are measurable, then 



i.e. if v = (/ • fj,) (so that f = then 

J fg dn = j g^ dn = j g dv 

whenever one of these sides exists (in which case the other side exists as well, and the two 
sides are equal.) 



Proof: If g = Ijy is an indicator function, then fi(flA) = (/ • h)Iai by definition of / • \x. If 
9 = Efc<n «fc J A fc is simple, then fj,(f E fc <„ a klA k ) = Efc<n QfcMAUj = Efc<„ <**(/• M^-A* = 
(f ' A t )(Efc<n a klA k )-, by linearity of the integral. So the result holds for simple g. 

If g is a non-negative measurable function, we may choose simple g n t 9- Then by the 
MCT, fi(fg) = \\m n n(fg n ) = lim n (/ • fi)g n = (f • fi)g. 

Finally, if g is an arbitrary measurable function, then fi\fg\ = fJ<(f\g\) = (/ ■ fJ-)\g\, since 



/, \g\ are non-negative. Hence n(fg) exists iff {f-fj)g exists (by Propn. 4.1.9). Now split g into 
its positive and negative parts to see that fJ-(fg) = ^(fg + ) — ^{fg~) = (/ ■ a0# + ~~ (/ ' /-Of? - = 



Remarks 4.5.3 The above proof illustrates a useful technique, which David Williams^ calls 
the standard machine. To prove something holds for all integrals of a certain type: 

• First show that it holds for indicator functions; 

• Use linearity to show that it holds for simple non-negative functions; 

• Then use the MCT to lift the result to non-negative measurable functions; 

• And finally split an arbitrary measurable / into its positive and negative parts, and use 
linearity once again. 



□ 



cf. his excellent (and short) book Probability with Martingales. 



Integration and Expectation 



65 



Recall that if (S, T, fj,) — > (T, T) is measurable, then the map 

defines a measure on (T, T). Also if (T,T) (R,B(R)) is measurable, then so is (S, F) 9 -H 
The next propn. shows that the integrals f g o f d\i and J" g are equal: 



Proposition 4.5.4 (Change of Variables) 

Given a measure space (S, T, fx), a measurable space (T,T) and two measurable maps 
f : S — > T and g : T — > M, then 



Kg ° /) = 0/ )s «- e - J g° f dn = J g d(fif ) 

whenever one of these sides exists (in which case the other side exists as well, and the two 
sides are equal.) 



Exercise 4.5.5 Prove Propn. 4.5.4| 



[Hint: Use the standard machine. For arbitrary measurable g, observe that /j,\g o f\ = [i(\g\ o 
/) = because \g\ is non-negative. This proves that g o / is /i-integrable iff g is 

^/ _1 -integrable (i.e. one side exists iff the other exists.) ] 

□ 

4.6 Definition of Expectation 

We begin with two examples that lead up to the definition of the expected value of a random 
variable. 

Example 4.6.1 Let (Q, J-, P) be a probability space, and let X be a discrete random variable, 
i.e. a random variable with at most countably many values. In that case X can be written 

oo 

x = Y1 XklA * 

k=l 

where the Xk are the values that X can take, and = {oj G f2 : X(u) = x^}. We are 
interested in the value of J X dP, assuming that it exists. Using the Lebesgue Dominated 
Convergence Theorem, it is easy to see that 



/oo oo 
X d¥ = x k ¥(A k ) = x k F(X = x k ) 
1 1 L, 1 



k=l k=l 

n oo 

(because X n = Yl x k^A k is dominated by \X\ = Yl \ x k\lA k , assuming that the A k are 

/,- I k 1 

mutually disjoint.) But this sum is just the definition of the expected value of a discrete 

random variable, i.e. 

EX - I X 

if X is discrete random variable. 



66 



Definition of Expectation 



□ 

Example 4.6.2 Let (f^J 7 , P) be a probability space, and let X be a continuous random 
variable, i.e. a random variable that has a probability density function fx such that 



F(X<x) = f f x (t)dt= f f x d\ 



(—oo,x) 

(This is the definition of a continuous random variable.) Now let Fx :=P° -V^ 1 be the law 
of X. Recall that Fx is a probability measure on (R, £>) with the property that 

Fx(B) = F(X € B) B <E £>(R) 

If we define z^x on (R, ,6) so that 

^ = fx i.e. v x {B) = Jjxd\ 

Then we know that z^x is a probability measure. Moreover, since 

u x {-oo,a}= j fx(x) X(dx) = F(X < a) = nx{-oo,a\ all a G R 

it follows that vx = Fx ■ (This is because fix , v x agree on the 7r-system of intervals of the 
form (— oo, a] which generates £>(M).) In particular, ^j- = fx, i.e. the density of a continuous 
random variable X is precisely the Radon-Nikodym derivative of the law of X w.r.t. Lebesgue 
measure! 

Now let g : R — > R be a Borel function and consider the integral J g{X) dF, assuming 
that it exists. We use the Change of Variable Formula to obtain 



J g(X)dF = J goX(u) F(du) 
g(x)FoX- 1 (dx) 
^ g(x)F x (dx) 
However, by the Chain Rule 

J g(x) dF x = J g(x) d ^d\ = J g(x)fx(x) X(dx) 



I- 



and hence 



J g(X)dP = J g(x)f x (x) dX 



But the integral on the right is just the definition of the expected value of a continuous 
random variable, i.e. 

Eg(X) = J g(X) dF 
In particular, with g = idjR, we have EX = j X dF. 



Integration and Expectation 



67 



□ 

We now wipe the slate clean and redefine the expectation of a random variable as follows: 
Definition 4.6.3 (Expectation) 

Let (fi, J 7 , IP) be a probability space. If X > or X G J 7 , P) then the expected 

value of X is defined to be 

EX := IX dF 



The two examples show that this definition will give the same results as the two earlier 
definitions of expectation for discrete and continuous random variables respectively. Moreover, 
we now also have a definition of the expected value of a random variable which is neither 
discrete nor continuous. 

Exercise 4.6.4 Consider the following random experiment: Independently, choose two ran- 
dom numbers X, Y uniformly from [0, 1]. Let Z := YI^ x< iy, 

Y if X<\ 
if x>\ 

(a) Explain why EZ = \. 

(b) Show that Z is neither a discrete random variable, nor a continuous random variable. 
(Thus you did not, until now, have a definition of EZ, even though you were able to 
calculate it!). 

(c) To calculate EZ from the measure-theoretic definition, we need to find a probability space 
that models this experiment. Let f2 := [0, 1] x [0, 1] be the unit square, with probability 
measure F the two-dimensional Lebesgue measure X(dx, dy) (i.e. F{B) = X(B) = area of 
B). Suitable definitions of X, Y are given as follows: If oj = (x,y) £ 17, define 

X{uj) := x Y{oj) := y 

(d) Show that F(X < a) = Area[0, a] x [0, 1] = a. Deduce that X is uniformly distributed 
with values in [0, 1]. (Similarly F(Y < b) = b ). 

(e) Show that X, Y are independent by showing that F(X < a,Y < b) = F(X < a)F(Y < b). 

(f) Thus Q,F, X, Y do indeed model the random experiment described above. Now show 
that / Z dF = J 5 y dx dy = \. 

The point of this exercise is to show that the new definition of expectation also works in cases 

where the random variable under consideration is neither discrete nor continuous. 

[You say: "But in statistics we would have handled this problem using the joint density of 

(X,y):" 

»1 rl 



EZ = / yIj x< i}f x ,Y(x,y) dx dy 
Jo Jo 2 



10 Jo 

and that is true, but you would then have had to have yet another definition for the expec- 
tation of a function Z = g(X, Y) of two random variables that possess a joint density] 



68 



Inequalities 



□ 



Some notation: If X is a random variable on a probability space (Q, F, P) and F G F, we 
define the integral ELY; F] of X over F by 



Now that we've defined the expectation of a random variable as its integral with respect 
to the probability measure, several facts about expectation are immediately obvious: 

Proposition 4.6.5 Let (ft, F, P) be a probability space. 

(a) Ec = c for any constant random variable c. 

(b) E(aX + bY) = aEX + bEY for any random variables X, Y (whose expectations exist) 
and any a, b G R. 

(c) If X >Y are random variables, then EX > EY . 

(d) If X,Y are random variables and X = Y almost surely, then EX = EY (where one 
expectation exists if and only if the other exists). 

Limit theorems about integration translate directly into limit theorems about expectation. 

Theorem 4.6.6 (Convergence) 

Suppose that X is a random variable and that X n is a sequence of random variables on a 
probability space (fl, F, P) with the property that 



Then 

(a) IfX n >0 and X n f X, then EX n f EX < +oo. 

(b) IfX n > 0, then EX < liminf EX n . 

(c) If there is a non-negative random variable Y with finite expectation such that \X n \ < Y 
for all n, then E\X n - X\ -)• 0, so that EX n -)• EX. 

(d) If for some finite constant K we have \X n \ < K for all n, then 
E\X n - X\ -»• 0, so that EX n -»• EX. 

Proof: (a) is the Monotone Convergence Theorem, (b) is Fatou's Lemma, (c) is the Lebesgue 
Dominated Convergence Theorem, and (d) is a special case of (c). 




X n — > X a.s. 



4.7 Inequalities 



Next we prove some important inequalities. For a random variable X, we define its variance 
Var(X) by 

Var(X) :=E[(X-ELY]) 2 ] 



Integration and Expectation 



69 



Proposition 4.7.1 Let J 7 , P) be a probability space. 

• (Markov's Inequality) If X is a random variable, and if g is a function which is 
increasing and non-negative, and whose domain includes range of X (so that the 
composition g(X) is defined), then 

Eg(X) > g(c)F(X > c) 

• (Chebyshev's Inequality) For e > 0, we have 

F(\X - E[X]\ >e)< s- 2 V&i(X) 

Proof: Let F = {oj £ Q : X{oo) > c} = {X > c}. Markov's inequality follows directly from 
the fact that 

/ g(X) aT> ! g(X) dF > [ g(c) dF 
J Jf Jf 

Chebyshev's Inequality is a direct consequence of Markov's — cf. next exercice. 

H 

Exercise 4.7.2 (a) Show that if X > 0, then F(X > c) < ±EA. 
(b) Prove Chebyshev's Inequality. 

□ 



Definition 4.7.3 Let U be an open subinterval of R. A function g : U — > M. is said to be 
convex if and only if for any x,y G U and any A G [0, 1] we have 

g(Xx + (1 - X)y) < Xg(x) + (1 - X)g(y) 



Remarks 4.7.4 The following remarks feature in the proof of Jensen's inequality, the next 
proposition, and should be digested thoroughly. 

(a) Recall that if a, b are points in IR n , then {Xa + (1 — X)b : < A < 1} is simply the line 
segment in W 1 joining b to a. The point with coordinates (Ax + (1 — X)y, g(Xx + (1 — X)y)) 
is simply a point on the graph of g between x and y. On the other hand, the point 
Xx + (1 — X)y,Xg(x) + (1 — X)g(y)) is a point on the line segment joining (y,g(y)) to 
(x,g(x)). These two points have the same x-coordinate, namely Ax + (1 — X)y. We can 
now interpret convexity geometrically: A function g is convex if and only if its graph lies 
below any chord joining two points on the graph of g. 

(b) This means that g is concave up (i.e. g" > if g" exists). 

(c) Let g : U — > M, where U is an open subinterval of M. For u,v £ U, define A(u,v) = 
^jEf^ • Geometrically, A(u,v) is the slope of the chord joing u to v on the graph of 
g. Then g is convex if and only if u < v < w in U implies A(u,v) < A(v,w). This is 
easy to see geometrically. A more rigorous proof: If u < v < w, define A = ^5^. Then 
v = Xu + (1 — X)w . It follows that 

v — IV u — V 

g(v) < Xg(u) + (1 - X)g(w) = g(u) H g(w) 

u — w u — w 



70 



Inequalities 



Hence 

(u — v)g(v) + (v — w)g(v) > (v — w)g{u) + [u — v)g(w) 
Rearranging yields the result. 

(d) Now if v < w in U and we let u f v, then 

(i) A(u, v) increases as u f v. 

(ii) A(u, v) < A(v, w), and thus A (it, u) is bounded from above as u f v. 

Since a sequence which is increasing and bounded from above must converge, D~(v) = 
lira A(u,v) exists. Similar reasoning shows that D + (v) = lim.A(v,w) must exist for every 

w\v w\.v 

v G U. Thus left- and right derivatives exist at every point v. Moreover, D~(v) < D + (v), 
because each A(u, v) is < each A(v,w). If these limits are equal, then g is differentiable 
at v. 

(e) A convex function is automatically continuous, and thus measurable B: For let v G U. If 
there is a discontinuity at v, then it is easy to see that either limA(«, v) or limA(v,w) 

w\v w\.v 

does not exist. 

□ 



Proposition 4.7.5 (Jensen's inequality) 

Suppose that g : U — >• R is a convex function on an open interval U C R, and that X is a 
random variable with values in U (almost surely) such that both X and g(X) have finite 
expected values. Then 

Eg(X) > g (EX) 



Proof: We use notation and results from Remarks 4.7.4 Let d £ [/, and let D (v) = 
limA(u,v) and D + (v) = lim A(-y, u>). Then D~(v),D + (v) both exist, and D~(v) < D + (v). 

Now suppose that m is a real number satisfying D~(v) <m< D + (v), and that x G U. We 
consider two cases: If (i) x < v, then A(x,v) < D~(v) (since A(u,v) increases as u t v) 
and thus A(x,v) < m. It follows that g(x) > m(x — v) + g(v). Next, if (ii) x > v, then 
A(v,x) > D + (v) (because A(y,w) decreases as w i v ) and thus A(v,x) > m. It follows that 
g(x) > m(x — v) + g(v). Hence, in either case, we have 

g(x) > m(x — v) + g(v)iov all v G U, x G U, and D~ (v) < m < D + (v) 

Geometrically, this means that the graph of g lies above the graph of the line m[x — v) + g(v). 
Note that both the graph of g and that of the line go through the point (v, g(v)). 

We are now ready to prove Jensen's inequality: Put v = EX. Then v G U because X 
takes values almost surely in U. We thus have 

g(X) > m(X - EX) + g{EX) whenever D~(EX) <m< D + {EX) 

If we now take expectations on both sides (i.e. if we integrate with respect to P on both 
sides), then 

Eg(X) > m(EX - EX) + Eg(EX) = g(EX) 



Chapter 5 

Products and Independence 



5.1 Product Spaces 
5.1.1 Introduction 

Example 5.1.1 (a) Denote by \iA the area of a subset A of M 2 . We know how to define n 
on rectangles, i.e. sets of the form A = B\ x 5 2 , where Bi, B2 are intervals in R: Indeed 

= ASi x A£ 2 (*) 

where A is Lebesgue measure. So fi is to be a measure on (M 2 ,,8(R 2 )) such that fJ,(B\ x 
-62) = A(-Bi)A(i?2). Of course, many sets in B(M 2 ) do not have the form Si x B2, and we 
would like fi to be defined for them as well. So (*) cannot serve as a definition of /x. 

(b) In probability theory, it is quite natural to consider the product of two probability spaces. 
Such products typically model sequences of independent experiments. For example, let 
n x = {H,T},T! = V{Q.i) and let Pi{F} = \= Pi{T}. Then (fii.Ji.Pi) models the 
tossing of a fair coin. Now let ft 2 = {1, 2, . . . , 6}, J" 2 = ^(^2) and P 2 {1} = P2{2} = • • • = 
P2{6} = g. Then (Q 2 , -7"2,P 2 ) models the rolling of a fair die. The underlying set of the 
probability space which models the combined random experiment "First toss a fair coin, 
and then roll a fair die" can clearly be taken to be the cartesian product = 0,i x f2 2 . The 
natural u-algebra will be T = V{$1\ x J7 2 ), and it is not hard to see that this cr-algebra 
is generated by the 7r-system {B\X B2 : B\ G J 7 !, B2 G ^2}- Now the event Si x i? 2 C 
consists of all those outcomes w = (wi, w 2 ) G S7i x 17 2 having wi G l?i and w 2 G i? 2 . Thus 
2?i x B2 occurs in the combined random experiment iff B\ and B2 occur in each of the 
individual experiments. 

The probability measure associated with the combined random experiment would there- 
fore naturally satisfy 

F(Bi x B 2 ) = Pi( J Bi)P 2 ( J B 2 ) (**) 

But not every event in V{VL\ x f2 2 ) is of the form B\ x £> 2 , so (**) cannot serve as a 
definition of P. 

□ 

The aim of this section is to construct, out of two measure spaces (S, S, fi), (T, T, v) an 
new measure space (S x T, S ® T, n <8> v) satisfying the following requirements: 



71 



72 



Product Spaces 



(i) A subset of S x T is called a measurable rectangle if it has the form A x B, where 
A G S,B G T. 

S T is defined to be the smallest cr-algebra onSxT which has all rectangles with 
measurable sides as members. 

(ii) For each rectangle A x B, we require that (fi <£> v)(A x B) = fiA ■ vB 

Remarks 5.1.2 (a) A remark on notation: We will be working with functions of more than 
one variable, and may integrate with respect to just one of those variables. We therefore 
introduce the following notation: 

J f(x) n(dx) := nf =: n x f(x) 

Thus, for example, J f(x, y) ^(dx) integrates the function f(x, y) over x, keeping y fixed. 
The integral JJ f(x,y) fx(dx) v(dy) is a double integral that first integrates / w.r.t. fi 
over the variable x, and then integrates the function y i-> f f(x, y) fi(dx) w.r.t v over the 
variable y. We may also write this as v y ([j, x f(x, y)). 

(b) Several times below, we will prove a result for finite measures, and then refer to a "stan- 
dard argument" to lift the result to cr-finite measures. This is done as follows: Suppose 
that fi is cr-finite on (S,S), and that a result $ has been proved to hold for finite mea- 
sures. Since /i is cr-finite, there exists a sequence of measurable sets A n f S such that 
\xA n < oo for all n G N. The measures fj, n := Ia„ • fJ- are finite on (S,S), so that result II 
holds for the \x n . By the MCT, if / G nuS+, then 

Hf = n(limfI An ) = lim/i n / 

n n 

This is often enough to show that $ holds for /j, as well. 

□ 

5.1.2 Products of Measure Spaces 

Given two measurable spaces (S,S), (T, T), we can construct a <7-algebra S (8> T on the 
cartesian product S x T, as follows: Define projections its '■ S x T — > S, ttt ■ S x T — > T by 

ITS ■ (s, t) !->■ S TTT : («, t) I— > t 

The interpretation is as follows: (s, t) denotes a sample point in a space of "combined" 
outcomes: i.e. s G S occurred and t G T occurred. Given such a combined outcome 
to = (s,t), 7rs(u;) = s measures which outcome occurred in S, and ttt(u) = t measures which 
outcome occurred in T. Given that we know a combined outcome co = (s,t), we should 
also know the component outcomes s and t. Thus the projection mappings tts,ttt should be 
measurable. The product cr-algebra S T is defined to be the smallest cr-algebra onSxT 
which makes these maps measurable. To recapitulate: 

Definition 5.1.3 Let (S, S) and (T, T) be measurable spaces. Define projections tts '■ 
S x T S, itt '■ S x T T by 

its ■ (s, t) !->■ s ttt : (s, i) i— > t 

Then define 5®T:= cr^s, 7Tt) to be the smallest cr-algebra for which both projections 
are measurable. 



Products and Independence 



73 



Exercise 5.1.4 Let (S,S) and (T,T) be measurable spaces, and let 1Z := {A x B : A G 
S, B € 7"} be the set of all measurable rectangles. Note that 1Z is a 7r-system. Show that 
S®T = a(K). 

Hence the product a-algebra is generated by the 7r-system of all measurable rectangles. 
[Hint: A x B = (A x T) n (S x B), and A x T = ^[A].] 

□ 

Exercise 5.1.5 Show that BjM 2 ) = B(R) <8>B(R). 



[Hint: Using Exercise |5.1.4| it is easy to see that B(R 2 ) D B(R) ®B(M). For the opposite 
direction, show that any open set in M 2 can be written as a countable union of sets of the 
form U x V, where U, V are open intervals in R. Cf. footnote [4] in Chapter 1.] 

□ 

Suppose that (S,S,/jl) and (T,T, v) are measure spaces. We would like to construct a 
measure [i <g> v on (S x T, S <8> T). One way that suggests itself is to define 



(1) {n®v)B := J J B (s,t) i/(dt)) /x(ds) = /i s (i/*/ fl (s,t)) BeS®T 
Another is to define it as 

(2) (pi®v)B := J (J I B (s,t) n(dsf) v(dt) = v\ii s I B {s,t)) BeS®T 



Exercise 5.1.6 Check that 

(ti®u)(A x B) = fj,A- vB A£S,B£T 
for both of the above possible definitions of n®v. 

□ 

We shall soon see that (i) the above definitions are both possible, and (ii) they coincide. 

We first investigate the possibility of defining fx v in the above manner. To be able to 
perform a double integral JJ f(s,t) v(dt) fi(ds) it is necessary that: 

(i) for each s £ S, the map t t— > f(s, t) must T-measurable, so that we can calculate the 
inner integral J f(s,t) v(dt)\ 

(ii) the map s h-> F(s) := J f(s,t) u(dt) must be 5-measurable, so that we can calculate 
the outer integral J F(s) fj,(ds). 

The following lemma gives us what we need: 

Lemma 5.1.7 Suppose that (S,S) and (T,T) are measurable spaces, that n is a a-finite 
measure on (S, S), and that f : S x T — > M + is S (£> T-measurable. Then 

(i) For each i £ T, the map s i— > f(s,t) is S-measurable. 

(ii) The map 1 1— )■ J f(s,t) fi(ds) is T-measurable. 



74 



Product Spaces 



Proof: We apply the Monotone Class Theorem (Thm. 2.3.5 First assume that fi is a finite 
measure, and let 

Ti = {f £ mS ® T : f is bounded and satisfies (i) and (ii)} 

It is easy to verify that H is a vector space (we need the finiteness of /i in order to avoid 
expressions of the form oo — oo), and that that each IaxB 6 H, where A £ S,B £ T. By 
the MCT, T~L is closed under bounded limits of increasing non-negative sequences. Moreover, 
the set 1Z := {A x B : A £ S,B £ T} is a 7r-system with the property that Ir £ H for 
every R £ 1Z, and thus by Thm. 2.3.5| every bounded S (g) "/"-measurable function belongs 



to H (since o~(lZ) = S (£> T). Now each non-negative measurable function / is the limit of 
bounded non-negative measurable functions (/ = Hm n / A n), and thus another application 
of the MCT shows that every / £ m(S ® 7~) + satisfies (i) and (ii). 

Now drop the assumption that ^ is a finite measure. Because fi is cr-finite, we can 
choose A n f S such that \xA n < oo. The measures \i n = Ia„ • A* are finite measures, and 
thus each map t \- > J f(s,t)fi n (ds) is T-measurable (where / > 0). Since J f(s,i) n(ds) = 
lim n J f(s,t) fj, n (ds), the MCT implies that the result holds for /j,. 



We now know that it is possible to define fj, (8) v in the ways indicated. What we don't 
(yet) know is that these constructions define a measure, and that they coincide. 
For definiteness, we arbitrarily fix one of the above definitions: 



Definition 5.1.8 Suppose that (S, S, fj,) and (T, 7~, v) are cr-finite measure spaces. Define 
a map [x® v : S ®T ^ — s- M + by 

(p <g> v)B := j j I B (s, t) u(dt) n(ds) = /i s (^/ B (s, t)) B £ S ®T 

li®v'is called the product measure of /i, v. 

Exercise 5.1.9 Show that \i®v defines a cr-finite measure on (SxT,5® T). 



□ 

The next two results show that (modulo certain conditions) we can calculate the integral 
w.r.t. (j, (g> v as a double integral, and the order of integration doesn't matter: 

j f d(n (8) v) = jj f(s, t) v(dt) v{ds) = jj f(s, t) fi(ds) u(dt) 

We first show this for non-negative measurable functions: 



Theorem 5.1.10 (Tonelli) 






Suppose that (S, S, /i) and (T, T, v) 


are a-finite measure spaces. If f £ m(S (£> 7") + 


, then 


(p®i/)/ = 




w 



Products and Independence 



75 



Proof: We use the Monotone Class Theorem (Thm. 2.3.5). First assume that fj,, v are finite 



measures. The result is obvious if / = IaxB, where A x B measurable rectangle, (or cf. 



Exercise 5.1.6). The class 

% = {/ G m(S T) : / is bounded and satisfies (*)} 



is easily seen to satisfy the requirements of Thm. 1.5.4 , and thus implies that % contains 
every bounded S (8 T-measurable function. The result for arbitrary non-negative / follows 
by MCT. 

A standard argument lifts the result to the case where \i, v are merely <r-finite. 



As a by-product, we obtain the result that our two possible definitions of [i(&v as iterated 
integrals coincide: If B G S®T, then Ig is a non-negative measurable function, and we may 
apply Tonelli's Thm. 

For non-negative functions /, the integral /i/ always makes sense, but we may have 
[if = oo. For arbitrary measurable /, we have to be more careful: 

Theorem 5.1.11 (Fubini) 

Suppose that (S,S,fi) and (T,T,v) are a -finite measure spaces. If f G x T,5® 

T, [i v), then 

^®v)f = [L S {v t f{ S ,t)) = v\[X S f{s,t)) 

Here the map t i-> fi s f(s,t) belongs to C^iT, T, v) for v-a.e. t £ T. Similarly, the map 
s h-» v l f(s,t) belongs to C l (S,S,jji) for [i-&.e. s G S. 



Proof: The result holds for |/|, by Tonelli's Thm., and hence N$ = {s G S : v t \f(s, t)\ = +00} 
is //-null, and Nt = {t G T : /j, s \f(s,t)\ = +00} is i^-null. Redefine f(s,t) to be zero when 
either s G Ns or t G Nt', this won't affect the integral of /, by Thm. |4.3.9| The result follows 
by splitting / into positive and negative parts. 

H 

Remarks 5.1.12 (a) Fubini's Theorem allows the interchange of the order of integration, 
provided the integrand is integrable w.tr.t the product measure. It follows from Fubini's 
Theorem that 

f du\ dfi = / I f d/j) dv 



provided that / G L . See Exercise 5.1.13 for what can happen if / C 1 



(b) Fubini's Theorem also easily extends to arbitrary finite products: If (Si, Si, /li) are a -finite 
measure spaces for i = 1, . . . , n, then 

(i) 1S1 (8) • • • (8) S n is the cr-algebra on S\ x • • • x S n which is generated by the projections 
-Ki : S\ x • • • x S n — > Si : (si, . . . , s n ) 1 — y Si. It is also generated by the family of 
measurable "rectangles" 1Z = {A\ x ■ ■ • x A n : Ai G <Sj for i = 1, . . . , n}. 

(ii) jUi g) • • • (8> fJ, n is the unique measure on Si ® • • • <g> S n which assigns to every rectangle 
the measure 

Gui <g> . . . x /j, n )(Ai x • • • x A n ) = mAi [i n A n 



76 



Independence 



(iii) Fubini's Theorem states that if / : S\ x • • • x S n — > M. is \i\ <8> ■ ■ ■ <8> ^ n -integrable, 
then 

/ / d(m ® ■ ■ ■ ® n n ) = / I / ■■■[ f dfx n ) . . . dfx 2 ) dm 

JS^-xSn J Si \Js 2 \Js„ J J 



and that any interchange of the order of integration is permissible. 



□ 



Exercise 5.1.13 Let 

x 2 - y 2 

2 i „,2\2 



(x 2 + y 



f(x,y) 
Show that 

f 1 ^ f(x, y) X(dy) \(dx) = - f 1 f 1 f(x, y) X(dx) X(dy) 
Jo Jo 4 Jo Jo 

What can you conclude about 

f fd(\® A) 

J[0,l]x[0,l] 



□ 



5.2 Independence 

The covariance Cov(X, Y) and correlation px,Y between two random variables are defined by 

Cov{X, Y) 



Cov(X, Y) := E[(X - EX)(Y - EY)] Px ,y := 



Var(X)Var(y) 



when this integral exists. It is an important and well-known that independent random vari- 
ables are uncorrelated (i.e. have px,Y = 0): 



Theorem 5.2.1 Suppose that X, Y G C 1 ^,^, P) are independent random variables. 
Then XY G C 1 ^, also, and 

EXY = EX ■ EY i.e. Cov(X, Y) = 

The same result holds in the extended sense if X,Y > 0. 



Proof: Recall that X, Y are independent if and only if a(X),a(Y) are independent a- 
algebras. Now if X, Y are both indicator functions, 

X = I A , Y = I B 

then 

E(XY) = J I AnB dP = F(A n B) = F(A)F(B) = E(X)E(Y) 



Products and Independence 



77 



because A G a(X),B G cr(Y), and thus A, are independent events. Next if 

n m 
3=1 k=l 

are non-negative simple random variables, then each Aj G cr(X) and each Bk G Thus 
the Aj, Bk are independent, and it follows that 



E(XY) = JxYdP 

= J2 a i b kW(AjnB k ) 



= J>P(A,) 

3 k 

= E(X)E(Y) 

again because P(A/ n B^) = P(A/)P(.Bfc) for each j, fc. If X, Y are non-negative ran- 
dom variables, we may choose simple non-negative random variables X n f Z,F n f y, 
with each X n measurable w.r.t. cr(X), and each y„ measurable w.r.t. cr(Y) (e.g. take 

ra2 n -l 

X n := ^2 ^r^{_^<x<^±i}+ n ^{x>n} to be the usual staircase functions). In that case X n , Y n 

k=0 

are independent (being measurable with respect to independent cr-algebras) and X n Y n f XY. 
By the Monotone convergence Theorem 

E(XY) = lim E{X n Y n ) 

n— >oo 

= lim E(X n )E{Y n ) 

n— >oo 

= lim EOf n ) lim E(Y n ) 

n— > oo n— >oo 

= E(jf)E(y) 

Finally if X, Y G /2 1 , we may split X, Y up into their positive and negative parts and apply 
linearity. 

The other assertions now follow easily. For example, cov(X, Y) = E(XY) — E(X)E(Y) = 
if X, Y are independent. Also var(X + Y) = cov(X + Y,X + Y) = (X + Y, X + Y); = 
(X, X) + (Y, Y) + 2(X, Y) = var(X) + var(Y) + 2cov(X, Y) = var(X) + var(F). 



Remarks 5.2.2 Note that it is not true in general that if X, Y G C l , then XY G C l . 
However, the above theorem shows that this is the case if X, Y are independent. 

□ 



Actually, there is an easier proof of Propn. 3.4.5 , if we adopt another point of view: If X, Y 



(X Y) 

are random variables, then (X, Y) is a random vector, i.e. a map (n,F,F) '-4'{R 2 ,B(R 2 )). 
Its distribution is a probability measure on (IR 2 , £>(IR 2 )) given by hx,y(B) = F{(X,Y) G B}, 
where B G B(R 2 ). If [ix^Y are the distributions of X, Y respectively, then the product 
measure [ix ® is another probability measure on (R 2 ,B(R 2 )). It turns out that X,Y arc 
independent iff fix,Y = ® ^Y- 



78 



Independence 



Theorem 5.2.3 Let X,Y be random variables on (£l,J-, P). Let Hx,Yi Mx> MY be the 
distributions of the random elements (X,Y),X and Y. Then X,Y are independent iff 



Proof: Suppose that X, Y are independent. If A x B is a measurable rectangle in B(M?) = 
B(M) <g> B(M), then 

Hx,y(A xB)= P(X eA,Y£B)= F{X € A)F(Y € B) = fi x A ■ ii Y B = (fi x ® Hy){A x B) 

Hence fJ>x,Y,IJ'X &> /Uy agree on a 7r-system that generates z3(IR 2 ) (the family of measurable 
rectangles). Since /ix,Y and ^x ® A*y agree on a 7r-system that generates £>(M 2 ), they are 
equal: /zx,Y = Hx ® /iy. 

Conversely, if /ix,Y = Mx &> /xy, then if x, y G M, we have 

P(X < x, y < y) = toy ((-co, s] x (-00, y\) = Mx(-oo, x] • //y (-00, y] = P{X < x)P(Y < y) 



Hence X, Y are independent (cf. Exercise 3.4.4). 



Exercise 5.2.4 Use the Change of Variables Thm. to show that 



XY dP = / xy djixY 



Now prove Propn. 5.2.1 once more, using Fubini's Theorem. 



□ 



Chapter 6 



Spaces of Random Variables 



6.1 Topological Vector Spaces 

6.1.1 Normed Vector Spaces 

Definition 6.1.1 A normed space is a pair (V, || • ||), where V be a vector space and 
is a norm on V, i.e. a function || • || : V — > R with the following properties: 

(i) ||x|| > for all x G V; 

(ii) \\x\ \ = if and only if x = 0; 

(iii) ||ax|| = \a\ \\x\\ for all x G V and a£l; 

(iv) \\x + 2/ 1 1 < IMI + ||y|| for all x, y G V (Triangle Inequality); 




The norm ||f|| of a vector w G V should be interpreted as its length, as in the following 
standard examples. 

Examples 6.1.2 (a) V := M with \\v\\ := \v\ (the absolute value), 
(b) V := M n with the Euclidean norm 



where v = (t>i, . . . ,v n ) 



(c) C with ||z|| = \z\ (the modulus) 

Here are some other norms on R n : 

Exercise 6.1.3 For x G M. n , let x = (x±, . . . ,x n ) 

(a) Define || • ||i on W l by 

||x||i = |xi| H 1- \x r , 

Show that 1 1 • 1 1 1 is a norm on M 1 . 

79 



□ 



80 



Topological Vector Spaces 



(b) Define || • ||oo on R n by 

||x||oo = max{|xi|, . . . , \x n \} 
Show that || • | |oo is a norm on W 1 ; 



We will give more interesting examples shortly. 

Exercise 6.1.4 Let (V, || • ||) be a normed space. 

(a) Prove that ||x|| = || — x\\ for all x G V. 

(b) Prove that 



□ 



\x — y\\ > | ||x|| — ||y|| forallx,yGF 



□ 



Here is a first look at function spaces: 



Exercise 6.1.5 Suppose that [a, b] is a closed interval in R. Let C[a, b] be the set of all 
continuous functions / : [a, b] — > R. 

(a) Show that C[a, b] is a vector space over the scalar field M, where the operations of addition 
and scalar multiplication are defined pointwise. 

(b) Define || • ||i : C[a,b] -> K by 

ll/lli:= f fit) dt 

J a 

Show that || • ||i is a norm on C[a, b\. 

(c) Define || • ||oo : C[a,b] -> R by 

H/lloo :=sup{|/(t)|:t€[a,6]} 
Show that || • | |oo is a norm on C[a, b\. 

□ 

If (V, 1 1 • 1 1) is a normed space, then we can define the distance d(v, w) between two elements 
v , w G V by 

d(v, w) := | \v — w\ | 
It is easy to see that the following holds: 



Proposition 6.1.6 For any u,v 


w in a normed space (V, • ): 


(i) d(v,w) > 0. 




(ii) d(v,w) = if and only if v 


= w . 


(Hi) d{v , w) = d{w , v) . 




(iv) d(v, w) < d(v, u) + d(u, w) 


(^.-inequality) 



Spaces of Random Variables 



81 



If X is any set, then a distance function d : X x X — > K which satisfies conditions (i)-(iv) 
of the preceding proposition is called a metric on X. The pair (X, d) is called a metric space. 
Thus we see that: 



Proposition 6.1.7 Every norm induces a metric or distance: If (V, || • ||) is a normed 
space, then (V, d) is a metric space, where d : V x V — > M is defined by 

d(v, w) := | \v — w\ | 



Once we have a metric, i.e. a concept of distance, we can use it to talk about limits and 
convergence: 



Definition 6.1.8 Suppose that (X,d) is a metric space, and that (x n ) is a sequence in 
X , and that x € X . We say that 

x n — > x or that lim x n = x 

n 

if and only ifd(x n ,x) — > 0. Note that (d(x n ,x)) is a sequence of real numbers, so that we 
already know what d(x n ,x) — > means: 

x n -> x <^ d(x n , x) ->■ o Ve > (BiVVn > N(d(x n , x) < e) 



We briefly introduce the following notions: 



Definition 6.1.9 (i) A sequence (x n ) n in a metric space (X , d) is called a Cauchy 
sequence iff Ve > 3N G N Vn, m > N [d(x n ,x m ) < e], i.e. form some point 
N onwards any two terms of the sequence (not necessarily successive) lie within a 
distance of e of each other.) 

(ii) A metric space (X, d) is said to be complete if every Cauchy sequence in X converges. 

(Hi) A normed vector space which is complete (w.r.t. the metric induced by the norm) is 
called a Banach space. 

Thus (V, || • ||) is a Banach space iff whenever \ < v n ) n is a sequence in V such that 
lim ||t> n — v m \\ = 0, there is v G V such that v n — > v. 



6.1.2 Inner Product Spaces 

Next, we consider an additional structure on a real vector space V: 



82 



Topological Vector Spaces 



Definition 6.1.10 An inner product space is a pair (V, (•, •)), where V be a vector space 
over R ( and (-, •) is an inner product on V, i.e. a function (-, •) : V x V — > K with the 
following properties: 

(i) (^ y) = (y, ^) for all x, y G V; 

(ii) (x, x) > for all x £ V; 

(iii) (x, x) = if and only if x = 0; 

(iv) (x, y + z) = (x, y) + (x, z) for all x,y, z G 

(v) (ax, y) = a(x, y) for all x, y G V and a € R. 

R n is an inner product space when equipped with the usual dot product: 

n 

(x, y) := x • y = ^2 x jVj 
i=i 

where x = (xi, . . . ,x n ),y = (yi, . . . ,y„). 

We shall shortly present more interesting examples of inner product spaces. For the 
moment, we note that every inner product induces a norm, in the same way that the dot 
product in R n yields a length: The length of a vector x G R n is given by \/x • x, and it turns 
out that ||x|| := \f (x, x) defines a norm in terms of the inner product. 

To prove that, we need the following result: 

Proposition 6.1.11 (Cauchy-Schwarz Inequality^ 

If (V, (•, •)) is an inner product space, then 

\(x,y)\ < v 7 (x,oc) y/(y,y) 

for all x, y G V. 

Moreover we have equality iff y is a scalar multiple of x. 
"Also called the Cauchy-Bunyakovskii-Schwarz Inequality 



Proof: Let (V, (•, •)) be an inner product space. Note that for a G R and x,y G V we have 

< {ax - y, ax-y) = a 2 (x, x) - a(x, y) - a(y, x) + (y, y) 
= a 2 (x, x) - 2q(x, y) + (y, y) 

Now, with x, y held fixed, consider the righthand side of the above inequality as a quadratic 
polynomial in a. Since it is always non-negative, it can have at most one root, and thus its 
discriminant is < 0, i.e. 

(x,y) 2 - (x,x) (y,y) < 

which yields the result upon taking square roots. 

We have (x,y) 2 — (x,x) (y,y) = only when this quadratic has a root, i.e. if there is a 
such that (ax — y, ax — y) = 0. But then ax = y, i.e. y is a scalar multiple of x. 

H 



Spaces of Random Variables 



83 



Proposition 6.1.12 If (V, (■, •)) is an inner product space, then the map \ \ • \ \ : V — > R 

given by 

\\x\\ := \f (x, x) 

defines a norm on V. 

Proof: We only verify the triangle inequality. Note that the Cauchy-Schwarz inequalitys 
states that \(x, y)\ < \ \x\\ \\y\\. Thus 

\\x + y\\ 2 = (x + y, x + y) = (x,x) +2(x,y) + (y,y) 
= ||x|| 2 + 2(x,y> + ||y|| 2 
< ||x|| 2 + 2||x|| \\y\\ + \\y\\ 2 

= (NI + IMI) 2 

H 

Thus an inner product induces a norm, and a norm induces a metric. An inner product 
space which is complete (w.r.t. the metric induced by the inner product) is called a Hilbert 
space. In particular, every Hilbert space is also a Banach space. 

The following proposition is an easy exercise: 



Proposition 6.1.13 Let V be an inner product space with induced norm \ \ ■ ||. 

(a) (Pythagoras' Law) If v,w G V, with v _L w, then \ \v + w\\ 2 = \ \v\\ 2 + \ \w\\ 2 

(b) (Parallelogram Law) If v,w G V, then \ \v + w\ \ 2 + | \v — w\ \ 2 = 2||u|| 2 + 2||w|| 2 



In R ra , the dot product does not only induce a length; it also induces an angle: The angle 
6 between two vectors x, y G R™ is given by 

cos = T , — rrr: — rr 



ll x ll llyll 

We can imitate this definition in an abstract inner product space (V, (•,•)), and define the 
angle between x, y £ V by 

cos 9 := rr-rfrrri where llxll := J (x, x) 

\\ x \\ \\y\\ 

By the Cauchy-Schwarz inequality it follows immediately that | cos#| < 1, so that this defi- 
nition makes sense. It also follows that | cos6*| = 1 if and only if x is a scalar multiple of y, 
i.e. iff x, y are parallel. We can also define orthogonality in an abstract inner product space, 
in the obvious way: 



Definition 6.1.14 Suppose that (V, (•,•)) is an inner product space. We say that x, y G V 
are orthogonal, and write x _L y, if and only if (x, y) = 0. 
If G C V, we say that x J_ G iff V# G G(x J_ g). 



84 



Topological Vector Spaces 



6.1.3 Orthogonal Projection in Hilbert Spaces 

If W is a linear subspace of R n , then we can project any x G M. n onto W. That is, we can 
represent x as a sum 

x = xll + x 1 - where x" £ W, x x _L W 

One can think of x" as the best approximation to x in W: It is the vector in W which lies 
closest to x. We call x" the orthogonal projection of x onto W. 

We would like to extend this idea of orthogonal projection to arbitrary Hilbert spaces. 
Suppose that V is a Hilbert space, and that W is a closed linear subspace of V. If vq G V, 
we would like to find the best approximation of vq in W. This is the unique vector wo with 
the properties that 

(i) w eW, and 

(ii) \\vq — wo\ \ = inf{||i>o — w\ \ : w G W}, i.e. wo is the vector in W that lies closest to vq. 

(iii) Moreover, (vo — wq) _L W. 

The vector u>o satisfying (i)-(iii) is called the orthogonal projection of vq onto W. Indeed, 
= i^o + (vq — wo) decomposes vq into a vector in W and a vector orthogonal to W. 
It remains to show that orthogonal projections exist and are unique. To be able to do 
that, we need an additional condition: We say that a linear subspace W of a Hilbert space 
V is closed if W is itself a Hilbert space. This simply means that if (w n ) is a sequence in W 
and w n — > v for some v G V, then v G W, i.e. that if a sequence in W converges, then the 
limit lies in W. Any linear subspace of Euclidean space is automatically closed. 



Proposition 6.1.15 Let V be a Hilbert space, and let W be a closed linear subspace of 
V. Then any vq in V has a unique decomposition 

vo = Vq + Vq where Vq G W, v q _L W 

Vq is called the orthogonal projection of vq onto W. 



Proof: Uniqueness: If 

vq = + vq = n{j + u^- 
where Vq,v)q G W and Vq,Uq _L W, then 

v o ~ u o = u o - v o ='■ x 

is a vector with the properties that x G W and that x _L W . This implies that ill, i.e. 
that (x, x) = 0. Hence x = 0, and so Vq = ujj, Vq = Uq . 

Existence: Let 5 = inf{||fo — w\\ : w G W}, and choose a sequence w n G W such that 
\\vq — w n \ \ — )■ (5. We show that (%)„ is a Cauchy sequence in W: for if e > 0, we may choose 
iV such that \\vq — w n \\ 2 — 5 2 < e whenever n> N. By the Parallelogram Law it follows that 
if n, m > N, then 

2e+25 2 > \\v -w n \\ 2 +\\v -w m \\ 2 = 2\\v -l(w n +w m )\\ 2 +2\\l(w n -w m )\\ 2 > 25 2 +^\\w n -w m \\ 



Spaces of Random Variables 



85 



Since {w n ) n is a Cauchy sequence, and since W is closed, there is wo G W such that w n — > wq. 
We will show that wq = Vq . The fact that | \vq — wo\ \ < \\vq — w n \ \ + \ \w n — wq\ | (for all n G N) 
then is easily seen to imply that ||t>o — ^o|| = <5- 

It remains to show that vq — wq _L W. Given an arbitrary w G W and a real A £ 1, have 
\\vq — ^oll 2 = 5 2 < 1 1 T^o — (wo + Aw) 1 1 2 , so that 

-2X(v - w Q ,w) + A 2 ||w|| 2 > 

Since this holds for all real A we must have (vo — wq,w) = 0. (Another way to see this is to 
note that the quadratic in A has a unique root at A = 0) and to calculate the discriminant.) 

H 

6.2 L p Spaces 

Definition 6.2.1 Suppose that (X,T,fi) is a measure space. If 1 < p < oo (p need not 
be an integer), then C p {S,F,fi) is defined to be the set of all ^-measurable functions 

S — > M such that is /z-integrable. 

A function S — > K is said to be essentially bounded iff there is a real number M such that 
|/| < M /i-a.e. 

C°°(S,J : , [i) is defined to be the set of all essentially bounded ^-measurable S — > M. 
For each 1 < p < oo, we define a map 1 1 • | | p : C P (S, S, fj,) — > K by 

\\f\\ P ■■= m p f* 

We also define a map || • ||oo : C°°(S,S,fi) — > E by 

ll/Hoo = inf{M : |/| < M /i-a.e.} 

The maps || • || p are called L p -norms, or just p-norms. 
If the underlying measure space is understood from context, we shall write CP instead of 

£ p (S,^). 

Remarks 6.2.2 • Note that C 1 is just the family of p-integrable functions (cf. Propn. 
4X6}. 

• Also, if S 4 R is -F-measurable, then / e C p iff P € C 1 iff ||/|| p < oo. 
. Finally, if / G C°°, then |/| < H/IU /i-a.e: 

{ S G 5 : \f(s)\ > H/lloo} = \J{ 8 G 5 : |/( S )| > ||/|U + 1} 

n 

is a countable union of /i-null sets. 

□ 



86 



IP Spaces 



We shall show that the C p spaces are almost Banach spaces, and that C? is almost a 
Hilbert space. 



Lemma 6.2.3 The C p spaces are vector spaces. 



Proof: If 1 < p < oo, then \f + g\P < (|/| + \g\) p < max{(2|/|)*>, (2\g\f} < 2P(|/|p + \g\P). 
Hence tif,g€ C p , then / \f + 5 |*> < 2^ (/ dfi + J \g\ p dfi) < oo. 

H 

For the next theorem, note that if a, 6 > 0, and if 1 < p, q < oo are such that p" 1 + q" 1 = 1, 
then 

, aP b q 

ab< 1 

V Q 



v ' 

Arithmetic-Geometric Mean inequality. 



To see this, define hit) = tb — and find the maximum of h. Alternatively, apply the 



Theorem 6.2.4 Let (S,S,fi) be a measure space and let f,g be real-valued S -measurable 
functions. 

(a) (Holder's Inequality) Suppose that 1 < p < oo and that p' 1 + q~ l = 1. // / G C p , and 
g G £ 9 , i/ien fg G and 

ll/5lli<ll/IWbl| g 

(7>j (Minkowski's Inequality) Let p > 1. If f,g £ CP ', then 

\\f + g\\ P <\\f\\ P + \\g\\v 



Proof: (a) If p = 1 (so that g = +oo), then |/#| < |/| HpH^ /x-a.e. and so \\fg\\i = n\fg\ < 

H/l • IMIoc = \\f\\l \\g\\oo < OO. 

If p > 1, put a = jjjjp , 6 = and apply the remark just before the statement of the 

theorem to conclude 

\f(s)g(s)\ \f(s)\ p \g(s)l 

\\f\\p\\9\\ q ~ p\\f\\ P P q\\g\\l 
Integrating both sides w.r.t. ji yields the result. 

(b) This relation is easy to prove if p = 1 or p = oo. For 1 < oo < p, note that q = 
and thus that \ f + <?| p-1 G CP. By Holder's inequality, 

||/ + sl|£< J \f\\f + gr 1 d l i + J \g\\f + g r l dn 

<\\f\\p-\\(f + 9) p ~% + \\g\\ P -\\(f + gy-% 

= (ii/Hp+imip) iiz+ffiir 1 

H 

It is now clear that || • || p satisfies the following: 
(i) ||/||p>0 ||/|| p = 0iff / = 0/x-a.e. 



Spaces of Random Variables 



87 



(ii) Ifa€R,then||a/||p = |a|||/|| p . 

(iii) ||/ + fl||p<||/||p + ||ff||p 

Thus || • ||p is almost a norm on C p . The requirement that ||/|| p = iff / = does't hold, but 
holds only almost everywhere. To get a bona fide norm, we must identify any two functions 
that are equal /x-a.e: 

Definition and Proposition 6.2.5 Let (S, J 7 , ll) be a measure space, and let 1 < p < 
oo. Define a relation = on CP by / = g iff / = g /U-a.e. Then « is an equivalence relation 
on CP. Define [/] := {g G C p : f = g}. Then [0] = {g G C p : g = /i-a.e.} is a vector 
subspace of C p , and [/] = / + [0] := {/ + g : g € [0]}. Let 

L p (S,T,Li) = {[f]:f€C p (S,S,fi)} 

Then L p is a vector space and the map, which by abuse of notation we also call || • || p , 
which is defined by 

:=||/||„ 

is a norm on L p . 



Proof: That = is an equivalence relation is straightforward, as is the statement that [0] is a 
vector subspace of C p . ft is also easy to see that L p is a vector space, if the operations are 
defined in the natural way (e.g. [/] + [g] := [f + g] — one must check that this is well-defined, 
i.e. that if [/i] = \fj\ and \g\] = \g2[, then [fi + gi] = [f2 + g2\, but that is easy.) [0] is clearly 
the zero vector in L p . Also, if [f±] = [/2], then f\ = /2 /i-a.e., and thus Liff = Lif^, which 
shows that ||/i|| p = ||/2||p (in case p < oo), and thus that || • || p is well-defined on L p . To see 
that it is a norm, note that (i) ||[/]|| p = ||/|| P > 0, and that ||[/]|| p = iff / = Li-a.e. iff 
[/] = [0]; (ii) \\a[f]\\ p = \\af\\ p = \a\ ||[/]|| p , and (iii) ||[/] + [ 5 ]|| P = ||/+ 5 || p < ||[/]|| P + ||b]|| P . 
fn case p = oo, it is also straightforward to see that || • ||oo is a well-defined norm on L°°. 

H 

In practice, we usually don't bother too much about the distinction between C p and L p . 
Now that we know that || • || p is a norm, we have a notion of convergence: 

Definition 6.2.6 A sequence (f n )n in C P (S, S, li) is said to converge to / G C p {S,F,ll) 
in L p (or in p th mean) iff \\f n - f\\ p ->■ 0. 

We now have two notions of convergence for measurable functions: almost everywhere con- 
vergence, and convergence in mean. We write 

n -> / fn -> / 



Theorem 6.2.7 (Riesz-Fischer) 

// (SjJ 7 , fi) is a measure space and 1 < p < oo, then L P (S,F, p) is a Banach space. 



Proof: Suppose that {f n )n is a Cauchy sequence in C p , i.e. that sup m>n \ \f m — fn\\p — > as 
n — > oo. 



88 



LP Spaces 



First assume that p > 1. For k G N, choose an increasing sequence (rafc)fc of natural 
no. such that sup m > nfc \ \f m - fn k \\ P < 2~ fc . Then by the MCT \\J2k \fn k+ i ~ h k \ \ \p < 
Efe ll/nfc+i - /n fc Hp < oo- hence J2 k l/n fc+1 - /nj < oo /x-a.e., and hence {f nk )k is a Cauchy 
sequence /ix-a.e. Define / : S 1 — )■ R by f(s) = lim*. / nfc (s) , if this limit exists, and /(s) = 
else. Then / is measurable, and that f nk — >■ / /x-a.e. as — >■ oo. Then by Fatou's Lemma, 

H/llp < liminf ||/ n || p < oo 

n 

(because Cauchy sequences are bounded), so that f £ C p , and similarly 

11/ - fn\\p = liminf ||/ nfc - f n \\ p < sup \\f m - f n \\ p ->• as n ->• oo 

k m>n 

Thus /„ 3 /. 

Next, assume that p = oo. We have sup m>n \f m — f n \ < sup m>n \ \f m — /n||oo /u— a.e., and 
thus (/ n )n is a Cauchy sequence /it-a.e. Define / : 5 — ► R as above: f(s) = lim n / n (s) if this 
limit exists, and f(s) = otherwise. Then 

\f \ < \ fn\ + |/n - /I = |/n| + hm |/ n - f m \ < \ \f n \\oo + sup \\f n - / m ||oo /i-a.e. 

so that / G and 

|/n - /I = hm|/„ - f k \ < sup ||/ n - / m ||oo /x-a.e. 

Hence - /||oo < sup m >„ - / m ||oo -> as n ->• oo, proving that / n ->• /. 

H 

The following result is easy: 
Theorem 6.2.8 Let (S, J 7 , fi) be a measure space. The map 

(.,.): £ 2 x£ 2 ^R :(/,<?) ^ J fg dfi 

is an inner product on C 2 which induces the L 2 -norm 1 1 • | (2- Hence L 2 (S, F, n) is a Hilbert 
space. 

Proof: Suppose that f,g G C 2 (S,J-,n). By Holder's inequality, we have ||/<?||i < 1 1/| I2I \g\ |2, 
so fg is integrable. It is now easy to see that (/, g) := J fg dfx defines an inner product on 
L 2 (where we identify functions that are /ii-a.e. equal to ensure (/, /) = implies / = 0). 
Furthermore, the norm induced by this inner product is precisely 

n/ih= </>/>* = (/ m 2 V) 2 = 11/112 

Thus C 2 {S,F,ij) is an inner product space. Since it is also a Banach space by the Riesz- 
Fischer Theorem, it is a complete inner product space, i.e. a Hilbert space. 



Spaces of Random Variables 



89 



6.3 W— Spaces and Probability 

Let us investigate the /7-spaces in the context of a probability space. 

Definition 6.3.1 (i) If X is a random variable on (^J 7 , P), then its p th moment is 
defined to be EX P (which exists iff X £ C P {Q,,F,\ 



(ii) The variance of a random variable X £ C 2 (n, 7",P) is defined by Var(X) := E(X - 
EX) 2 = EX 2 - (EX) 2 . The first equality shows that Var(X) > 0. 

(iii) The covariance of two random variables X, Y £ C 2 (Q,J-,P) is defined by 
Cov{X, Y) := E(X - EX)(Y - EY) = EXY - (EX)(EY). 

(iv) The standard deviation of X £ C 2 (Q, J 7 , P) is given by ax ■= VarpT)^. 

(v) The correlation of two random variables X, Y £ £ 2 (fi,.F,P) is defined by pxy = 

Cov(x,y) 

a x <?Y 



Note the polarization identity 

Cov(X, Y) = - (vax(X + Y) - Xax(X - F)) 

We know that for 1 < p < oo, the space £ P (J7, J 7 , P) is a Banach space. If 1 < p < oo, then 
£ P (Q, J-, P) consist of all those random variables that have p th moments. The norm on CP is 
simply the p th root of the p th absolute moment, i.e. 

\\X\\ p = (E\X\v)v 

If p = cc, then C P (Q, J 7 , P) consists of all essentially bounded random variables. Holder's 
inequality is simply the statement 

E(XY) < E(\X\ p )pE(\Y\ q )* 

where X £ C P ,Y £ C q and ± + ± = 1. 

^(OjJ 7 , P) is a Hilbert space, with inner product (X, Y) := EXY. This has a nice 
interpretation if we consider only the space £q(Q, J-, P) of centered random variables, i.e. 
£ 2 -variables with zero mean. For then EXY = Cov(X, Y), i.e. the covariance is the inner 
product (when restricted to £g). Similarly, the standard deviation of X is simply the norm 
(length) of X: ax = \ \X\\2- Finally, the correlation is given by 

Cov(X,r) (X,Y) 

PX,Y = = ||y|| My,, = COS# 

0~X&Y \\-& \\2\\J- \\2 

i.e. the correlation between two centered random variables can be interpreted as the cosine 
of angle between them in C? . In particular, two centered random variables are uncorrelated 
iff they are orthogonal. 

The Cauchy-Schwarz inequality is simply the statement that |E(XY)| < ||X||2||Y||2, with 
equality only if Y is a scalar multiple of X. If X, Y have zero mean, this amounts to saying 
|Cov(X, Y)\ < ax&Y, i-e. —1 < px,Y < 1, with equality only if Y is a scalar multiple of X. 
Since px,Y = cosO, this is not surprising. 

For probability theory, the following result is also useful: 



90 



LP Spaces 



Proposition 6.3.2 Let (J), J 7 , IP) be a probability space. Lf 1 < p < r < oo, then 



\X\\ p < \\X\ 



for any random variable X, so that £ r (£l,J-,F) C £ P (Q, J 7 , ) 
Moreover, if X £ C°° , then 



-X" oo = lim \\X\ 

p— >oo 



Proof: Note that if X G C , then X p £ £p. Now p' = ^ and q' = ^£^, satisfy the relation 
^7 + ^ = 1, and so Holder's inequality applied to / = \X\ P and g = 1 yields 

\\X\\ P = f fgaT < \\f\\ P >-\\g\\ q > = (J \X p \l dFf -l = \\X\\l 



| oo • 

1 



Next, suppose that X € C°° . Then ||-X"|| P < ||X||oo, so limsup p ||X|| P < \ \X\ 

If M < \\X\\oo, then f \X\ P dF > M p P(\X\ > M) and so > M ¥{\X\ > M)p. Now 

F(\X\ > M) > 0, because M < \\X\loo, and thus liminf p ||X|| P > M (because P(|X| > 
i 

M)p — > 1). Since M was arbitrary, also liminf p ||^|| p > ||X||oo- 



Propn. 6.3.2 states that if 1 < p < r < oo, and if the r th absolute moment of X exists, 
then so does the p th absolute moment. 



Chapter 7 

Conditional Expectation 



7.1 Definition of Conditional Expectation 

The concept of conditional expectation, in its modern form due to Kolmogorov, is quite 
abstract, but it is one of the most important concepts in probability theory. We therefore 
begin with some examples to familiarize ourselves with the idea of conditioning, and to provide 
some motivation. 



7.1.1 Conditioning on an Event 



In 6 3.1, we introduced the notion of the conditional probability P(£?|yl) that B occurs, given 



that A has occurred: 

«^ 

Suppose that X is a random variable and that A is an event with positive probability. Before 
we know that A has occurred, we have a best estimate of what x will be, namely EX. If 
we are told that A has occurred, however, we will revise our expectation. For example, the 
expected value if a fair die is rolled is 3.5. However, if we are told that the outcome is even, we 
revise our estimate: the expected value is now 4. If we are told that the outcome is odd, the 
expected value is 3. Essentially, when we are told that an event A has occurred, we revise our 
probability measure: Each outcome outside A is now assigned measure 0, whereas each event 
inside A has its probability scaled up by dividing by F(A): P(B\A) = S^l for all BCA We 
use the new measure 1P(. | ^4) on the sample space A to calculate our revised expectation: The 
expectation of X given A is 



E[X; A] 
P(A) 



7.1.2 Conditioning on a a— Algebra in a Discrete Probability Space 

Suppose that we are working with a discrete probability space (r^J 7 , P), where £1 = {uj n : 
n = 1,2,...}, and each P{w n } > 0.. If Q is a c-subalgebra of J-, then there is a partition 
B\, i?2) • • • which generates Q, i.e. such that Q = cr(Bi, B2, . . .). In the previous subsection, 



91 



92 



Motivation for the Definition of Conditional Expectation 



we defined E[X|G] for any G G Q. In particular, z n := E[X|B„] has been denned for each 
block B n . 

Consider now a random variable Z defined by 

Z{uj) = z n when to G B n 
i.e. Z is a random variable which takes the constant value c n on the block B n : 

Z = Y J CnI Bn 

n 

It is then clear that 

The random variable Z is measurable with respect to the cr-algebra Q. 



Z is clearly an approximation of X. In fact, Z is the best approximation of X which is 
measurable with respect to the a-algebra Q, where "best" is meant in a sense which we shall 
make fully precise later. 

Example 7.1.1 A fair die is rolled, and you get an amount equal to the outcome ui, but only 
if it is even. Let X be your winnings, i.e. X{uj) = w/| 2i 4 i 6}( w )- Let Q = a({2, 4, 6}, {1, 3, 5}). 
We calculate the random variable ELY|C/]: 



nx\g\(oj) 



EM2.4.6}] = E ^| 2 ; 4 ^ }] = i(2+ 3 4 + 6) =4 if.iseven 



P({2,4,6}) 



□ 



Since Z := E[X|C/] is a random variable, we can try to integrate it. If Z(ui) = c n when 
uj G B n , i.e. if c n = E[X\B n ], then 



E[Z; B n ] = c n ■ E[/bJ because Z has value c n on B 
= E[X\B n ]F(B, 
E[X;B. 



n 



F(B n ) 
E[X; B, 



>-F(B n ) 



Arbitrary a-algebras on arbitrary probability spaces need not be generated by countable 
partitions, i.e. there may be no "blocks". However, we can use an "averaging" property to 
make the above calculation independent of a generating partition. Since E[X;^4] = f A X dP 
is the integral — a kind of "average" — of the random variable X over the set A, it follows 
that X and, Z = K[X\Q] have the same integrals — "averages" — on each block. Now each 
G G Q is a union of blocks B n , i.e. G = \J k B rik for some sequence B nk . Thus, assuming the 
integrals exist, 

f X dP = J2 [ x dP = J2 [ ZdP= f ZdP 

k n k k n k 

i.e. E[Z; G] = ELY; G] for all G G Q. 
We have shown: 



Conditional Expectation 



93 



Theorem 7.1.2 Let X be an integrable random variable on a discrete probability space 
(Q, P), and let Q be a u-subalgebra of T . Then the conditional expectation E\X\Q\ is a 
random variable with the properties that 

(a) EpT|C/] is (/-measurable. 

(b) f G E[X\G] dP = f G X dP for all G € Q. 



7.1.3 Conditioning on a a— Algebra in a General Probability Space 

To define E[X|C?] on a general probability space, not necessarily discrete, we turn things 



upside down, and make Theorem 7.1.2| into a definition: 



Definition and Theorem 7.1.3 (Kolmogorov) 

Suppose that (f2,J-", P) is a probability space and that X is a random variable in 
>C 1 (0,J 7 , P). Let Q be a sub-u-algebra of J- . Then there exists a random variable Z 
such that 

(i) Z is -measurable. 

(ii) Z G J-.P), i.e. E(|Z|) < +oo. 

(iii) For every set G £ Q, we have 

E[Z; G] = E[X; G] i.e. / Z dF = [ X dF 

Jg Jg 

Moreover, if Z' is a random variable satisfying (i),(ii),(iii), then Z = Z' a.s. Any random 
variable Z with the properties (i),(ii),(iii) is called a version of the conditional expectation 
of X given Q. We write 

Z = E[X\G] a.s. or Z = E g X a.s. 



Definition 7.1.4 We define conditional expectation w.r.t to general random variables in 
the following manner: 

E[X\Y] :=E[X\a(Y)] 

To prove that conditional expectations exist, we give a geometric argument, involving 
approximation in a Hilbert space. 

Before we start the second proof, recall that a Hilbert space V is a vector space which 
is equipped with an inner product, which we will denote («i,«2)- Now an inner product 
automatically induces a norm (length), and angle 

||v|| - (v,v)i cose- v ' J "'- ; 



\ v l \\\\ v 2\ 



Here 6 is the angle between v\ and V2- We say that fi,i>2 are orthogonal if {v\,V2} = 0. 
Hilbert spaces are also complete, i.e. every Cauchy sequence in V converges (to a vector in 

V). 



94 



Motivation for the Definition of Conditional Expectation 



Suppose that W is a complete subspace of V. We then have the notion of orthogonal 
projection onto W. Given any vector v G V, there exists adecomposition 

v = + v 1 

a unique vector w with the following properties: 

(1) uN G W. 

(2) v 1 - _L W, i.e. (v 1 ^) = for all w G W. 

(3) ||f - v^\\ = mi{\\v - w\\ : w G W}. 

Thus is the vector in W which is the best approximation of v : It lies closer to v than any 
other w G W. t>" is called the orthogonal projection of v onto W. 

Recall also that C 2 (Q,J-,F) is a Hilbert space, with inner product (X,Y) = EXY and 
induced norm ||X|| 2 = (EX 2 ) 5. (We ignore here the small difference between I? and C 2 .) 



Proof of Thm. \7.1.3\ First assume that X G £. 2 {n,T,F). Note that £ 2 (ft,£?,IP) is a 
closed subspace of £ 2 (f2, J 7 , P) , and thus there exists a decomposition 

X = Z + y where Z££ 2 (tt,g,F) and F _L C 2 (Q,Q,F) 

Moreover, \\X - Z\\ 2 = inf{||X - [7|| 2 : U G £ 2 (17,a,P)}. Now Z is clearly ^-measurable. 
Also, HGeg, then I G G £ 2 (0, and soFl I G . Hence 

E[Z; G] = {Z, Iq) = (X, I G ) = E[X; G) all G G Q 

It follows that Z = E[X\Q] a.s. 

For X G £ 1 (r2,J r , P), we use an approximation argument. First assume that X > 0, 
and for n G N, define I„:=lA n. Then X n f X, and each 1„ G £ 2 (fi,J",P). By 
the above, there are Z n G C 2 (Q,G,F) such that Z n = E[JT„|C/] a.s. Next, note that if 
n < m, then X n < X m , and thus Z n < Z m a.s.: For if s > 0, and G £ := {Z n — Z m > e}, 
then G £ G G, so that < e ■ FG £ < E[Z n - Z m ;G £ ) = E[X n - X m ;G £ ] < 0. Hence 
FG £ = for all e > 0. Now {Z n > Z m } = {] km Gi, and hence F(Z n > Z m ) = 0, i.e. 
Z n < 2 m a.s., for each pair n < m. Taking the (countable) intersection over all such pairs 
yields F(Z n is an increasing sequence) = 1, i.e. the sequence (Z n ) n is increasing a.s. Define 
Z = limsup n Z n . Then Z is ^-measurable, and Z n \ Z a.s. If G G Q, then by two applications 
of the MCT we have 

E[Z; G] = limE[Z n ; G] = limEpf n ; G] = E[X; G] 

n n 

Hence Z = E[X\G] a.s. 

The existence of E[X|C?] for integrable X follows by decomposition into positive and neg- 
ative parts. 

The a.s. uniqueness of E[X|£7] is straightforward: If Z, Z' are two versions of E[X|£/], and 
e > Othen G £ := {Z-Z' >e}eG, and hence < e-FG £ < E[Z-Z';G E ] = E[X-X;G £ ] = 0. 
Arguing as above, we see that F(Z > Z') = 0. By symmetry, F[Z' > Z) = as well, i.e. 
Z = Z' a.s. 



Conditional Expectation 



95 



Remarks 7.1.5 (a) If Q = {0,17}, the algebra of zero information, then 

E[X\Q] = EX 

Recall that only the constant functions are Q — measurable, and it is obvious that the best 
constant approximation to X is EX. However, we must be precise, and show that EX is 
a version of the conditional expectation. Since EX is just a number, we can regard it as 
a constant function, so that EX is <5 -measurable. Moreover 

I X dP = I EX dP = I X dP = I EX dP = EX 
h h Jn Jn 

and thus J G X dP = J G EX dP for all G G Q. 

(b) If X is -measurable, then E[X|t/] = X. Again, this is obvious: If X is ^-measurable, 
then Q contains all the information needed to determine X. The best (/-measurable 
"approximation" of X is therefore X itself. Check the definition!! 

(c) If B G G and P(B) > 0, then E[E[X\G}\B] = E[X\B}. To see this, note that 

E[E[X\G]\B] = ¥ ^- ) J b nX\G] dF =^B ) J B XdF = E ™ 

Thus the random variables X and E[X|C?] have the same conditional expectations if we 
condition over events in Q. 



□ 



Example 7.1.6 Let (O, ^F", P) be the probability space which models the rolling of a fair 
die, and let Q be the a-algebra which contains the information whether the outcome is odd 
or even, i.e. Q = ct({1, 3, 5}, {2,4, 6}). Let X be a random variable with X(oj) = u 2 . We 
want to determine E[X|£?]. Now E[X|£] is ^-measurable, and therefore constant on the sets 
A = {1, 3, 5} and B = {2, 4, 6}. The integral of X over the set A is 

E[X;^] = l(l 2 + 3 2 + 5 2 ) = ^ 
Since E[X|£] is constant on the set A, with value ca say, we must have 

c A [ dP = [ E[X\G] dP= [ X dP = ^ 
J A J A J A 

which yields ca = Similarly the value cb of E[X|^] on B must be q, = ^. We thus have 

E[X\Q] = f I{1, 3 ,5} + f ^{2,4,6} 

Check that this satisfies all the requirements of a conditional expectation. 

□ 



96 



Properties of Conditional Expectation 



Example 7.1.7 Take Q, = (0, 1] with the cx-algebra of Borel sets and with P the Lebesgue 
measure on (0, 1]. Define two random variables X, Y as follows: 



X(x) = 3x z Y(x) = { 



1 if < x < - 
~ 2 

Ax if - < x < 1 



We want to find a version of E[X|Y]. First we need to find the cr-algebra generated by Y, 
i.e. we must describe the family of sets Y~ 1 (B), B G B. Now Y _1 ({1}) = (0, |], which shows 
that this half-open interval is a smallest non-empty set in cr(Y). This makes sense: Y cannot 
distinguish between any of the elements of (0, |], and therefore neither can a(Y). Moreover, 
if 2 < a < b < 4, then Y _1 (a, b) = (|a, C (i,l]. It is therefore clear that every open 
set in (|, 1] belongs to cr(Y), and thus that the family £>(|, 1] of Borel subsets of (|, 1] is a 
subset of a(Y). It is now easy to see that A G cr(Y) iff there is B G 1] such that either 
J 4 = Bori = (0,5]UB. This completely describes the family <j(Y). Since E[X|y] has to be 
a(F)-measurable, it must be constant on (0, |]. Also, since (0, |] G <r(y), we must have 



/ E(X\Y)dP= f X aT= [ 2 3x 2 dx=- 
•Ao,§] V(o,i] 7o 8 



so that the random variable E[X|Y"] must take the constant value | for x € (0, |]. The cr- 
algebra cr(y) can distinguish between all the points in the interval (§, 1]: For example, if we 
know that Y = 3, then we know that the outcome is x = |. Thus if we know the value of Y 
for values 2 < Y < 4, then we know the outcome, and if we know the outcome, we know the 
value of X. We therefore expect the best cr(Y)-measurable approximation of X over (^,1] 
to be X itself, i.e. we expect E[X|Y](x) = 3x 2 for \ < x < 1. Putting this together, define a 
random variable 



Z(x) 



\ if < x < \ 
3x 2 if \ < x < 1 



We just need to show that Z is a version of E[X|Y]. By similar arguments as for Y, it is clear 
that a(Z) = cr(Y), and thus that Z is <r(Y)-measurable. Now suppose that A G cr(Y). Then 
either A = B for some £ G 1] or A = (0, \] U 5. In the for mer case we obviously have 
E[Z; A] = ELY; A], because Z = X on the interval (|, 1]. In the latter case, we have 



E[Z; A] = E[Z; (0, |]] + E[Z; B] = | + E[X; B] = E[X; A] 



□ 



Conditional Expectation 



97 



7.2 Properties of Conditional Expectation 



Theorem 7.2.1 (Properties of Conditional Expectation) 

The following are true for random variables on a probability space (£1, J-, P) whenever the 
expressions occurring inside a conditional expectation are integrable. 

(a) E[E[X\g}] = EX a.s.; 

(b) If X is Q -measurable, then E[X\g] = X a.s. 

(c) LINEARITY: E[ ai X x + a 2 X 2 \G] = a^X^g] + a 2 E[X 2 \g] a.s. 

(d) POSITIVITY: IfX>0, then E[X\0\ > a.s. 

(e) cMCT: IfQ<X n \X, then E[X n \g]\ E[X\Q] a.s. 

(f) cFATOU: If X n > 0, then E[liminf n X n \Q] < liminf n E[JT n |g] a.s. 

(g) cDCT: // \X n \ < Y (all n 6 N) for some integrable Y , and if X n — > X, then 
E[X n \g] -> E[X\g] a.s. 

(h) PROJECTION: E[X ■ E[Y\g]] =E[E[X\G]-Y] =E[E[X\g]-E\Y\G]]. 

(i) IfY is g -measurable, then E[YX\g] = YE[X\g] a.s. 

(j) TOWER: IfHQg, then R[R[X\g]\H] = E[E[X\H]\G] = E[X\H\. 

(k) INDEPENDENCE: If U is independent of a(X) V g, then E[X\g VH] = E[X\g] 
a.s. 



Exercise 7.2.2 Prove Thm. |7.2.1{ a), (b), (c), (d). 



[Hint: (c) means that if are versions of E[Xfc|£/] for k = 1, 2, then a{Y\ + a 2 Y 2 is a version 
oiE[ ai Xx + a 2 X 2 \g]. 

For (d), let Z = E[X\G] a.s., and note that {Z < 0} = {Jnehd 2 < -£} G S-] 



□ 



Proof of Thm. |7.2.1| (e)-(k): 



(e) : Suppose that < X n f X, and define Y n := E[X n \g] a.s. By (d), Y n is increasing a.s. 
Define Y = limsup n Y n , so that Y is (/-measurable and Y n f Y a.s. U G € Q, then by the 
MCT, E[X;G] = lim n E[X n ;G] = lim n E[y n ;G] = E[Y; G). Hence Y = E[X\g] a.s. 

(f) : Let Z n : = inf/%>„Xfc a.s., so that Z n f liminf n A" n . Since Z n < X^ whenever k > n, 
we have E[Z n |<5] < E[Xfc|C7] a.s. whenever k > n, and hence E[Z„|C?] < inffc> n E[X^|^] a.s. 
Now by cMCT, 

E [liminf X n \g] = ]hnR[Z n \0\ < lim inf E[X k \G] = liminf E[X n \g] 

n n n k>n n 

(g) : Y ± X n are non-negative random variables, so by cFATOU, 

E[Y|g]+liminf (±E[X n |g]) = liminf E[Y±X n |g] > E[liminf Y±X n \g] = E[Y\g]±E[X\g] a.s. 



98 



Properties of Conditional Expectation 



Since E[Y|C?] is integrable, it is finite a.s., and hence can be cancelled to yield liminf n (±E[X n |C7] > 
±ELY|C/], which implies 

E[X\g\ < liminf E[X\G] < limsupE[X n |0] < E[X\G] a.s. 

n n 

(h): This follows from the usual properties of projections if X, Y G C 2 (Q,F, P): For 
suppose that X = X\\ + X±,Y = Yii + Y± are decompositions of X, Y into components 
parallel and perpendicular to £ 2 (f2,£?,P, so that Xu = E[X\g],Yu = E[Y\g] (by the second 
proof of Thm. 7.1.3| ). Then 



E[X-E[Y\g}} = (X,Y\ { ) = (X||,Yf|) =E[E[X|a] -E[Y\g]] a.s. 

because (Xj_,Yii) = 0. If X, Y G £ 2 (fi, J 7 , IP) are non-negative, we may define X n := X A 
n, Y n := Y A n. Then < X„ | X and < Y n | K, and X n , Y n G £ 2 (17, J 7 , P). It follows by 
the MCT and cMCT that 

E[X ■ E[Y\G]} = limE[X n • E[Y n \g\] = limE[E[X n |g] • E[Y n \g\] = E[E[X\g] ■ E[Y\S]] a.s. 

(i): If Y = Iq is an indicator function, and G' G g, then 

E[YE[X\g};G'] =E[E[X\g]\GnG'] =E[X;GnG'}=E[YX;G'} 

Hence YE[X|C?] is a version of E[yX|£7]. The result now follows by linearity and cMCT. 

(j): Consider the case where X G £ 2 {n,T,F). Since C 2 {n,7i,F) C £ 2 (n,g,F) C 
£ 2 (r2, J 7 , P) are closed Hilbert subspaces, the result follows from the fact that a projection of 
a projection is a projection. Alternatively, let 

Y := E[X\g] a.s. Z := E[R[X\&\\H] = E[Y\H] a.s. 

IfH£Ucg, then 

E[Z; fl] = E[Y; H] = E[X; H] 

and hence Z is a version of E[X|W], i.e. E[E[X|g]|^] = E[X|7£] a.s. 
The fact that E[E[X|H]|a] = E[X\U) a.s. follows directly (i). 

(k): Let Y := E[X\g]. Since Y is certainly g V ^-measurable, we must show that 
E[Y; F] = E[X;F) for all F G Q V U. Now let C := {G n F : G G 0, # G W}, and let 
V := {F € g VH : E[Y; F] = E[X; F]}. First note that C Q V: For if G G G, H G W, then 
E[X; Gnff] = E[X/ G ]E[/ H ], by independence, and so 

E[X; G n H] = E[X; G]E[I H ] = E[Y; G]E[I H ] = E[Y; GDH] 

since YIq is independent of %. It is straightforward to verify that C is a 7r-system that 



generates g V 7i, and that P is a A-system. Hence by Dynkin's Lemma (Thm. 1.5.4), 

v = g\/n. 



Proposition 7.2.3 (Jensen's inequality) 

Suppose that g : U —> M is a convex function on an open interval PCI, and that X is 
a random variable with values in U (a.s.) such that both X and g(X) have finite expected 
values. Then 

E[g(X)\g]>g(E[X\g]) 



Conditional Expectation 



99 



Proof: We use notation and results from Remarks 4.7.4 Let v G XJ, and let D (v) = 
limA(u, v) and D + (v) = lim A(v,w). Then D~(v), D + (v ) both exist, and D~(v) < D + (v). 

Now suppose that m is a real number satisfying D~(v) < m < D + (v), and that x E U. We 
consider two cases: If (i) x < v, then A(x,v) < D~(v) (since A(u,v) increases as u t v) 
and thus A(x,v) < m. It follows that g(x) > m(x — v) + g(v). Next, if (ii) x > v, then 
A(v,x) > D + (v) (because A(y,w) decreases as w i v) and thus A(v,x) > m. It follows that 
g(x) > m(x — v) + g{v). Hence, in either case, we have 

g(x) > m(x — v) + g{v) 

for any v S U, any x £ U, and any D~(v) < m < D + (v). 

We are now ready to prove Jensen's inequality: Put v = JL[X\Q]. Then 

g(X) > m(X - E[X\G]) + g(E[X\G]) a.s. whenever D~(E[X\g]) <m< D + (E[X\G}) 

If we now take conditional expectations on both sides , then 

E[g(X)\g] > m(E[X\g] - E[X\G]) + E[g(E[X\g])\g] = g(E[X\Q\) 

H 

Some notation: we define 

E[x\g\n] :=e[e[x \g]\n) 

We end this section with some simple examples: 

Examples 7.2.4 (a) Suppose that X is a random variable in £ p (f2, J 7 , P), where p > 1, and 
let Y be a version of E(X|t/). Then Y in £ p (f2, J 7 , IP) as well. This is because g(x) = \x\ p 
is convex. By Jensen's inequality, we therefore have |E(X|C?)| P < E(|X| p |t/) a.s., i.e. 
\Y\ P < E(\X\ p \g). It follows that E\Y\ P < E[E(\X\ p \g)} = E\X\ P < +oo, by (I). 

(b) Suppose that X £ £ 2 (tt, T,F) (i.e. that var(X) exists). If Y is a version of E(X 2 \g), 
then Y G £ 2 (0, J",P) as well, by (a), and EY 2 < EX 2 . Since EY = EX (by (I)), we thus 
have 

var(Y) = EY 2 - (EY) 2 < EX 2 - (EX) 2 = var(X) 

Thus var(y) < var(X). This reflects the fact that Y, being cruder, can't vary as much 
as X can. 

□ 

7.3 Radon— Nikodym Theorem 

We begin with some definitions: 



Definition 7.3.1 Let u, /x be measures on a measurable space (0, J 7 ). 

(i) We say that v is absolutely continuous w.r.t. /x, and write v ^ fi, iff fj,A = implies 
vA = for all A G T. 

(ii) /j,, v are equivalent iff <C and /x <C ^. 

(hi) We say that /i, ^ are mutually singular, and write /i _L v, iff there exists ^4 G S such 
that /i^ = vA c = 0. 



100 



Radon-Nikodym Theorem 



Remarks 7.3.2 (a) Two measures are equivalent iff they have the same null sets. 

(b) Two probability measures are equivalent iff they have the same null sets, iff they have 
the same sets of measure 1, iff they have the same sets of positive measure. 



□ 



4.5.1 



Suppose that (SI, J 7 , fi) is a measure space, and that / G F + . Recall from Propn. 
that there is a measure v = f ■ fx on (fi, F), defined by 

uA= I fdfi 

J A 

It is clear that v <C fx. 

The map / is called the density, or Radon-Nikodym derivative, of u w.r.t. fx, and also denoted 

dp 
dfi ' 

Furthermore, we showed that for 

g dv = f g — du 
J dfx 



for any z/-integrable function g (cf. Propn. 4.5.2 the Chain Rule). 



The Radon-Nikodym Theorem (below) states that this way of constructing an absolutely 
continuous measure is the only way to do so: If u <C fx, then u has a density, i.e. then v = f -fx 
for some non-negative measurable / Also recall that if v = f ■ fx, then ug = fx(fg) (cf. Propn. 



4.5.2 the Chain Rule). 



We leave the following proposition as an exercise: 



Proposition 7.3.3 (a) Let fx, u, r\ be a-finite measures on a measurable space (fi, J-), and 
suppose that ^ and ^jj exist. Then exists, and 

drj dn du 
dfi du dfx 

(b) If jt: exists and 4^ > fx-a.e., then 4^ exists and 



dii ^ lJ0 ° ""*"> dfi " t* u " c "' du 

dfx 1 



du du/dfx 



fi-a.e. 



where ^ may be defined to be an arbitrary constant (e.g. 0) where ^ = 0. 
In particular, fx, v are equivalent measures. 



Here is the main result of this section: 



Theorem 7.3.4 (Radon-Nikodym) 

Suppose that u,/x are a-finite measures on (fi,.F) and that u <C fx. Then exists, i.e. 
there exists a fx-a.e. unique f S mJ- + such that u = f ■ fx. 



Conditional Expectation 



101 



Proof: We prove the result for the case where v are finite measures on (S7, J 7 ), and that 
v < //. If v(9) = 0, then v(A) = f A 0d/j> for all A G J 7 , i.e. ^ = 0. We may therefore 
assume that u{Q) > 0, from which it follows that also fi(Q) > 0. 
Define the probability space space (f2,.F,P) by 



Q:=nx {0,1} J": = {(4 x {0}) U (B x {1}) : A, B G J 7 } 



and 



P((Ax{0})U(5x{l})) : = 



M^) Kg) 
//(fi) K n ) 



(It is straightforward to verify that (fi,.F,P) is a probabilility space.) 
Now define a random variable X and a u-subalgebra Q C J" by 

^:=4x{i} 5:={4x{0,l}:AeJ} 

Let Z := E[X|<7], and note that 

Z(w, 0) = Z(w, 1) P-a.s. 

Therefore we may define a function g : (17, J 7 ) — > (R, 6(E)) by 

g(u) := Z(u,j) J =0,1 

Note that g = Z o ^, where tj : ($7, J 7 ) — >■ (fi,^") : w >->■ (w,j) is measurable. Hence 5 is 
J'-measurable. Furthermore < g < 1 , because X is an indicator function. 

By definition of conditional expectation, we have E[X; ^4 x {0, 1}] = E[Z; A x {0, 1}] for 
any A e T. Now clearly E[X; A x {0, 1}] = F(A x {1}) = On the other hand 



E[Z;Ax {0,1}] 



1 [ M 



+ !/(«)]) 



We may conclude that 



and thus that 



i.e. that 



j A 9 d{\[Jji) + v(n)]) - 2u(n) 
2 M W j A 9 d » + 2m j A 9 dv = a^ny ^ 1 d ^ 



5 rfy" 



2/x(J2) ^ y "'' 2/' 



5 <^ 



(*) 



Define therefore a new measure r\ on (SI, J 7 ) by 

v( A ) ■= 2m j A 9 d ^= 2^T) j A l ~ 9dv 



Then 



dr\ 



drj 1 - g 



dfi 2fi(n) dv 2i/(n) 



102 



Radon-Nikodym Theorem 



and thus 77 <C fi, rj <C ^. 

With t4 := {a; G : = 1} in (*)., we see that f A 1 

L J A g dfj, = also. It follows that € : <?(w) = 1}) = 0, and since v <C fj, that 



g dv = 0, and hence that 



MP) 

G O : = 1}) 



Now define / : (ft, J 7 ) 



0. In particular ^({w £ Q : 1 

dv 
dr) 



g{uj) = 0}) = 0, and hence $ > 



a.o. It follows that 5fe exists, and that ^ = — cf. Propn. 



drj/du 

(R,B(K))by 
( i/(J2) 



7.3.3 



/(«) 







if < 1 

else 



Then 



and thus / 



/ dfx 



dv 
dfj, ' 



(i- ff )/2Kft) 



d/i dr] 



dfj, 



dv 

dp, 



dfj, = v(A) 



Exercise 7.3.5 Extend the above proof to the case where fi, v are <7~finite. 



□ 



Chapter 8 

Convergence in Probability Theory 



8.1 Convergence of Random Variables and Measures 

8.1.1 Modes of Convergence 

We list here various forms of convergence: 
Definition 8.1.1 (Modes of Convergence) 

Let (Q, F, P) be a probability space, and let X±, Xi, ■ ■ ■ and X be random variables. 

(a) X n converges to X almost surely, written X n — > X a.s., provided that the set {uj € 

: X n (uj) -ft X(u)} has P-measure 0, i.e. provided that 

P(X n -> X) = 1 

(b) X n converges to X in probability, written X n — > X, provided that 

limP(|X n - X\ > e) -»• for all e > 

Tl 

(c) X n converges to X in C p , or in p th mean, written X n — > X, iff each X n £ C p {Sl, T , P), 
and \\X n — X\\ p — > 0, i.e. iff E|X n | p < +00 for all n and 

E|X„ - X\ p -> as n -> +00 

(d) We say that X n converges weakly to X, written X n ^> X, provided that 

limE/(X„) = Ef(X) for every bounded continuous / : R ->• R 

n 

This is also called convergence in distribution or convergence in law. 

We shall look at weak convergence in greater detail later, and shall provide alternative for- 
mulations. Note that the notion of weak convergence of random variables makes sense even 
if these variables are defined on different probability spaces. 

Remarks 8.1.2 (a) Many of the above notions can be extended to random vectors. For 
example, HX n ,X : (n,T,P) -»• (R d ,B(R d )), we say that X n ^ X iff lim n P(||X„ - X\\ > 



103 



104 



Convergence of Random Variables and Measures 



e) = for all e > 0. Proofs of the results below can easily be extended to the more 
general case. 

(b) Just a note of caution: The limits in the above notions of convergence are not unique, 
but unique only up to a.s. -equality. Thus, for example, if X n X and X = Y a.s., then 
also X n ^> Y. The same goes for a.s. convergence, convergence in C p and convergence in 
probability. 

(c) A sequence of random variables (X n ) n is said to converge to X in distribution iff 

Fx n (x) Fx{x) at every x where F is continuous 

where Fx n ,Fx are the distribution functions of X n ,X. We shall see that this notion 
coincides with weak convergence. 

□ 

Note that convergence in p th mean is just ordinary convergence in the metric space C p 
(with metric d(f,g) = \ \f — g\\ p ). Convergence in probability is also determined by a metric. 
First note the following: 

Lemma 8.1.3 If a function f : R + — > R + satisfies 

(i) f is increasing; 

(ii) f is concave; 
(Hi) /(0) = 0; 

Then d(x,y) := f(\x — y\) defines a metric on R. 

Proof: It is clear that d(x, y) > 0, that d(x, y) = iff x = y, and that d(x, y) = d(y, x). The 
triangle inequality follows directly from the fact that f(x + y) < f(x) + f(y). To see this, 
assume w.l.o.g. that x < y. Then y = ^ • + ^(x + y), and so f{y) > ^f(x + y). 
Similarly f(x) > ^rf{x + y), and the result follows by addition. Now d(x, y) = f(\x — y\) < 
f(\x-z\ + \z-y\) y <f(\x-z\) + f(\z-y\) = d(x,z) + d(z,y). 

H 

Lemma 8.1.4 Suppose that f : M + — > M + is a function which is increasing, and strictly 
increasing on some interval (0,a), is bounded, continuous, and satisfies /(0) = 0. Then 

X r Ax iff E/|X n - X\ -)• 
Proof: Nothing is sacrificed if we restrict ourselves to the case X = 0. 

(=>): Let K be a bound for /, and let e > 0. Choose S such that < f(x) < e whenever 
< x < 6. Then 

Ef{\X n \) = E[f(\X n \); \X n \ >5}+ E[f(\X n \); \X n \ < 5} < KF(\X n \ > 5) + eF{\X n \ < 6) 

and so lim n E/(|X n |) < e. Since e was arbitrary, lim„E/(|X n |) = 0. 
(=>): Let e > 0. Then 

f{e)I { \ Xn \>e} < f(\Xn\)I{\X n \>e} < f(\X n \) 



Convergence in Probability Theory 



105 



and so 

< /(e)limsupP(|A„| > e) < limE/(|AJ) = 

n n 

It follows that lim„ PflAJ > e) = 0. 



Using the function f{x) = x p A 1, we see that: 
Proposition 8.1.5 Ifp>l, then X n 4- A iffE[\X n - X\* A 1] ->■ 0. 



In particular, with p = 1, we see that 

X n 4 X iff E[|A n - X\ A 1] -> 

As we shall see now, the map d(A, Y) := -E^jA — Y"| A 1] defines a metric: Let J 7 , IP) 

denote the family of all random variables, and let L°(Q, P) denote the set of all equivalence 
classes of random variables, where X, Y are equivalent iff X = Y a.s. 



Proposition 8.1.6 There is a metric d on L°(fl, J-, P) such that d(X n ,X) —> iff X n 4 
X , and such that the metric space (L°(n,T,F),d) is complete. 



Proof: By Lemmas 8.1.3 and 8,1.4[ it suffices to find a bounded, strictly increasing, concave 



function / : M+ -> M + satisfying /(0) = 0, for then d(X,Y) = f(\X - Y\) defines such a 
metric. One example is given by f(x) = x A 1, i.e. 

d(X,Y) =E[\X-Y\ A 1] 

(Another is given by f(x) = i.e. 

d(X,Y) =E A ~ 1 



i+ x-y 



Yet another is given by f(x) = arctana;.) 

Suppose that (X n ) n is a Cauchy sequence in L°, i.e. that sup m>n d(X m , X m ) — > 0. Choose 
an increasing sequence {jik)k of natural numbers such that sup m > nfc d(A m ,A n J < 2~K Then 
E^ fc (|A„ fc — A nfe+1 | A 1) < oo, and thus |A nfe — A„ fe+1 | < oo a.s., from which it follows 
that (A nfc )fc is a Cauchy sequence a.s., and thus converges a.s. to some measurable limit A. 

To see that X n — > A, note that d(X n ,X) < d(X n , X nk ) + d(X nk , A), which goes to zero as 
n, k — > oo, by DCT and the fact that (A n ) n is a Cauchy sequence. 

H 

We now investigate how the various modes of convergence are related. 

p 

Proposition 8.1.7 X n — > X iff every subsequence (A rtfe )/ C has a sub sub sequence (A n , )j 
such that X nk —> X a.s.. 

P 

In particular, if X n — > X a.s., then X n — > X. 



106 



Convergence of Random Variables and Measures 



Proof: (=>): Fix a subsequence (X nfc )& of [X n ) n . Choose a subsubsequence (X Hk )j such 
that 

Then E,-P(l^n fc . - X| > i) < oo, so by BC1, P(|X„ fc . - X\ > i), i.o.) = 0. It follows that 

X, 



X a.s. as j — > oo. 



(■<=): We use Propn. 8.1.5 with p = 1. Suppose limsup n E[|X„ — X| A 1] > e for some 



e > 0. Choose a subsequence (X nfe )fc such that E[|X„ fc — X\ A 1] > e). If (X nfc )j is a 



subsubsequence, we cannot have X, 
mx nk . -X\ Al]^0by DCT. 



nk, 



X a.s., for then \X n — X\ A 1 — >■ a.s., and hence 



Corollary 8.1.8 If X n X and f : 

fix). 



is a.s. continuous at X, then also f(X n 



Proof: Given a subsequence (X nk )k, choose a subsubsequence such that X nk — > X a.s. Then 
f(X nk ) — > f(X) a.s., because / is a.s. continuous at X. By Propn. 8.1.7, f{X n ) — > f(X). 



Proposition 8.1.9 If X n — > X , then X n ^> X . The converse holds if X is a.s. constant. 



Proof: First assume that X n — > X. If / is a bounded continuous function and E/(X n ) -fo 
Kf(X), we may choose a subsequence {X nk )k such that liminffc |E[/(X nfc ) — /(^)]| > 0. 
Choose a subsubsequence such that X nk . — > X a.s., and thus also f(X nk ) — > f(X). By DCT 
E/(X nfc ) —7- E/(X) as j — > oo, contradiction. 

Conversely, if X n c, where c£l, then since f(x) = \x — c\ A 1 is a bounded conti nuous 
function, we have E[\X n - c\ A 1] — > E[|c — c| A 1] = 0. Thus X n — > X, by Propn. 



8.1.5 



c p P 
Proposition 8.1.10 If X n ^t X for p > 1, then X n — > X. 



Proof: Since \X n - X\* A 1 < \ X n - X\ p , we see that E[\X n — X\ p A 1 
E|X n - X\P 0. By Propn. 



8.1.5 



p CP 
we see that X„ —> X whenever X„ — > X. 



whenever 



Remarks 8.1.11 So far we have seen that a.s. -convergence and £ p -convergence both imply 
convergence in probability, and that convergence in probability implies weak convergence. 
Without further assumptions, this is the best we can do: 

(a) Convergence in C p does not imply almost sure convergence: On the unit interval with 
Lebesgue measure, consider the following sequence of intervals: 

[0, 1], [0, |], [| ; 1], [0, i], [i |], [1, 1], [0, i], [i I), [I % [f , 1], [o, h . . . 



Convergence in Probability Theory 



107 



Let X n be the indicator function of the n interval on this list. It is clear that for 
sufficiently large n, J Xn d\ is arbitrarily small, and thus that X n converges to in C p 
for all p > 1. However, X n does not converge pointwise anywhere: For any x, there are 
infinitely many intervals on the list which contain x, and infinitely many which don't. It 
follows that limsup n X ra = 1, liminf„ A n = 0, i.e. that X n is nowhere convergent. 

(b) Almost sure convergence does not imply convergence in C p : On the unit interval with 
Lebesgue measure, let X n = nI^ Q (i) P j- Then X n — > a.s., but E|A„ — 0| p = 1 for all 
n G N, so X n does not converge to in C p mean. 

(c) Convergence in probability does not imply convergence in C p or almost sure convergence: 
This follows from the previous two examples. For instance, (a) is an example of a sequence 
which converges in mean, and thus in probability, but not almost surely. 

(d) Weak convergence does not imply convergence in probability: Let B be a Bernoulli vari- 
able on some probability space (f^J 7 , P), taking only the values and 1, each with 
probability \. Let X n = B for each n. Also put X = 1 — B. Then X is also a 
Bernoulli variable, and also takes only the values and 1, each with a probability of 
\. Thus the X n ,X are identically distributed , and thus certainly X n ^> X (because 
Ef(X n ) = Ef(X) = Ef(B)). Nevertheless, \X n - X\ = 1 for all n and all u G Q. Thus 
X n does not converge to X in probability. 

□ 



8.1.2 Uniform Integrability 



Definition 8.1.12 (a) A collection of random variables X is said to be uniformly inte- 
grable (UI for short) if and only if for every e > there is K G R such that 

E[|A|; \X\ >K]<e for all X G X 

i.e. if and only if 

p(x) = sup E[|X|; \X\ > x] satisfies p{x) — > as x — > oo 

(b) A collection of random variables X is said to be uniformly ^-continuous (u-P-c for 
short) if and only if for every e > there is 5 > such that whenever ¥(F) < 5, then 

E[|X|;F]<e for all A GAT 

i.e. if and only if 

sup E[| X\ ; F] -> as F(F) -»• 

We say that a discrete- or continuous-parameter stochastic process X t is UI if and only 
if the collection of its component random variables {Xt}t is UI. 



108 



Convergence of Random Variables and Measures 



Remarks 8.1.13 1. Note that a UI family X is bounded in C : Let e = 1, and choose K 

such that sup E[|X|; |X| > if] < 1. Then 

xex 

sup E|X| < sup E[|X|; |X| < if] + sup E[|X|; |X| > K] < K + 1< oo 

x&x xex xex 

In particular, if X is UI, then every X <E X is integrable. 

2. The name uniform ¥ -continuity has its origins in the following. Suppose that A* is a u-P-c collection of 
integrable random variables. Define, for each X £ X & measure fix by fix(A) = f \X\ dP = ¥,[\X\;A]. 
Then each fix is absolutely continuous with respect to P. The definition of uniform P-continuity states 
that if P(A) -> 0, then fix (A) -> uniformly in X. 

□ 

Before we discuss what UI actually means, and why it is useful, consider the following 
proposition. 

Proposition 8.1.14 Suppose that X G J 7 , P). T/jen i/ie singleton {X} is both UI 

and u-P-c. 

Proof: We first show that {X} is u-P-c. If not, then there exists an e > and a sequence 
F n £ J 7 such that 

P(F n )<2-« but E[\X\;F n ]>e 
Let F = limsup n F n . Then by the Reverse Fatou Lemma we have 

mX\;F] > limsupE[|X|;F n ] > e 

n 

Yet, by the First Borel-Cantelli Lemma, 

P(F) = P(F„,i.o.) = 

(because Yln^(^) ^ °°0 Contradiction. 

Next, we show that {X} is UI. Let e > 0, and choose S > such that E[|X|;F] < e 
whenever P(F) < S. By Markov's inequality 

P(|X| > K) < — E|X| 
and since E|X| is finite, we can make P{\X\ > K) < 5 by taking if > 

H 

Exercises 8.1.15 1. X is UI if and only if {\X\ : X e X} is UI. 

2. We've already noted that a UI family is bounded in C 1 . However, not every /^-bounded 
family is UI: Let (Q, J 7 , P) be the usual probability space on the unit interval [0, 1], together 
with Lebesgue measure. Let 

F n = [0, 1/n] X n = nI Fn 

for n G N. Show that X = {X n : n G N} is bounded in £ 1 , but that X is not UI. 
Of course, the family X is some sense pathological, because X n — > 0, but EX n 0. 



Convergence in Probability Theory 



109 



3. Suppose that X is a family of random variables dominated by some Y G C 1 (i.e. \X\ < Y 
a.s. for all X G X). Show that X is UI. 

4. We can generalize the previous exercise: Let X,y be two families of non-negative random 
variables. Further suppose that y is UI, and that for every X £ X there is a Y G y such 
that X < Y. Show that X is UI as well. 

5. Show that any finite family of integrable random variables is UI. 

6. Show that if X,y are two UI families, then the families X U y and X + y = {X + Y : X G 
X,Y ey} are also UI. 

7. Show that if X is a family of identically distributed integrable random variables, then X 
is UI. 

□ 



Note that every UI family is bounded in C 1 (by Remarks 8.1.13), but not every C l 



bounded family is UI (by Exercise 8.1.15). Nevertheless, being UI is only just stronger than 



being bounded in C , as the following proposition makes clear: 



Proposition 8.1.16 // 1 < p < oo and X is a family of random variables which is 
bounded in C p , then X is UI. 



Proof: Choose a bound Bel such that E\X\ P < B for all X G X, and let K > 0. If x > K, 
then clearly x = x l ~ p x p < K p x p , because 1 — p < implies K l ~ p > x l ~ p . Thus, putting 
x = \X\ and integrating, we get 

E[\X\; \X\ > K] < K 1_P E[|X| P ; \X\ > K] < K l ~ p B 

for all X G X . Thus if e > and if we choose K sufficiently large to ensure that K 1 ~ P B < e, 
then we will have E[\X\; \X\ > K] < e for all X G X. 

H 

Uniform integrability and uniform P-continuity are closely related: 



Proposition 8.1.17 A family X of random variables is UI if and only if it is u-P-c and 
bounded in C . 



Proof: Suppose that X is UI. Then clearly X is £ 1 -bounded. Let e > and choose K such 
that E[|X|; \X\ > K] < e/2 for all X G X. Then if F is a measurable set with P(F) < 
we have 

E[\X\;F] =E[\X\;Fn{\X\ < K}} + E[\X\; F n {\X\ > K}] <K¥{F) + e/2<e 
Hence X is u-P-c. 

Conversely, assume that X is u-P-c and that /^-bounded. Choose iV such that E\X\ < N 
for all X £ X. Now let e > 0, and choose 5 > such that P(F) < 5 implies E[|X|; F] < e for all 
X eX. Also choose K sufficiently large that %<8. ThenP(|X| > K ) < j^E\X\ < N/K < 5, 
and thus E[|X|; \X\ > K] < e for all X G X, as required. 



110 



Convergence of Random Variables and Measures 



Next, we want to prove a theorem shows that uniform integrability is just the right concept 
to prove £ p -convergence. Before we prove it, however, we need the following lemma: 

Lemma 8.1.18 Suppose that X n are non-negative random variables, and that X n -4 Xoq. 
Then {X n } n is UI if and only ifEX n — > KX^. 

Proof: Recall that convergence in distribution is equivalent to weak convergence of the 
respective distributions, i.e. that X n Xoo if and only if E/(JT„) — > Ef(X O0 ) for every 
bounded continuous function / : R — > R. 

First assume that EX n — > EJfx,. If if > 0, then f{x) = x A (if — x) + is clearly bounded 
and continuous. Note also that X n — X n A (if — X n ) + > 0, and that the random variables 
X n and X n — X n A (if — X n ) + coincide on the set {X n > if}. It therefore follows that 

E[X n ; X n > if] = E[X n — X n A (if — X n ) + ;X n > if] < E[X n — X n A (if — X n )+] 

Because X n -4 X^, we have E[X n A (if — X n ) + ] —> Epf^, A (if — X 00 ) + ], and because in 
addition EX n — > EX^ , we see that 

E[X n — X n A (if — X n )+] -> EfXoo — Xoo A (if — 

Further note that X^ A(/sT-X 00 )+ f ^oo as if oo, so that EfXoo A [K-X^] f EA^, 
by the Monotone Convergence Theorem. It follows that E[X DO — X^ A (if — X DO ) + ] — > as 
if — y oo. 

We are now ready to prove that {X n } n is UI: Let e > 0, and choose N sufficiently large 
so that E[X n — X n A (if — X n )+] < Epf^ - X^ A (if - X^] + § whenever n > N. In 
particular, E[X n ; X n > if] < EfA^ — X^ A (if — Xoo) + ] + |. Now choose if sufficiently large 
so that E^ - A (if - X^+j < § . Then 

E[X n -X n >K)< £ - + £ - 

whenever n > N . By increasing if, if necessary, we can also ensure E[X n ;X n > if] < e for 
the finitely many n < N. It follows that {X n } n is indeed UI. 

Conversely, suppose that X n -4 Xoo and that {X n } n is UI. Then since x A if is a bounded 
and continuous function of x, we have E[X n ] > E[X n Aif ] —> EfXoo Aif] for all if > 0, so that 
liminf„E[X n ] > EfXoo A if]. By the Monotone Convergence Theorem, Epf^ A if] -> EXoo 
as if — >• oo, which proves that EXqq < lim inf n EX n Q In particular, if {X n } n is UI, it is 
/^-bounded, and so EXoo < oo. Now note that 

\EX n - EXool < \E[X n ] — [X n A if ] | + \E[X n A if ] - E[X M A if]| + (E^ A if ] - EA^I 

As n — > oo, the middle term goes to zero, whereas as if — > oo, the outside terms go to zero 
(uniformly). It follows that by taking n and if sufficiently large, \EX n — EX^ can be made 
arbitrarily small, so that EX n — > EX^. 



1 We have therefore proved a version of Fatou's Lemma for convergence in distribution: If X n > and if 
X n -4 Xoo, then KX^ < liminf„EX n . Note that this does not require the family {X n }„ to be UI. 



Convergence in Probability Theory 



111 



□ 

p 

Note the following useful fact about convergence in probability: If X n — > Xoo and / : 

R — > R is continuous, then f(X n ) — > f{X 00 ). To see this, recall that a sequence of random 
variables converges in probability if and only if every subsequence has a further subsequence 
which converges almost surely. But given a subsequence (f(X nk ))k of (f(X n )) n , we can 
choose a further subsequence (X nk )j of {X nk )k such that X nk — > Xoo a.s. (as j — > oo). By 

F 

continuity of /, also f(X nk ) — > f(X 00 ) a.s. as j — > oo. And this shows f{X n ) — > f(X oa ). 
We are now ready to state and prove an extension of the Dominated Convergence Theorem: 



Theorem 8.1.19 Let p G [l,oo), and let {X n : n G N} be a sequence of random variables 
in C p which converges in probability to some random variable X^. Then the following 
statements are equivalent: 

(a) X n -)■ Xqo in C p ; 

(b) lim E(|X„H =E(|X 0O |*); 

n— >oo 

fcj {Xn : n G N} is UI; 

fdj {X^ : n G N} is uniformly F '-continuous. 

Furthermore if either of the following conditions are satisfied, then (a)-(d) hold: 

(e) sup„ E(| Xnl 9 ) < oo for some p < q < oo (i.e. X n is bounded in C q ); 

(f) There is Y <E C p such that \X n \ < Y for all n G N. 



Proof: (a) =4> (b): This follows from the rather easy fact that if x n — > x in some normed 
space, then ||x n || —> \ \x\\ in R. 

(b) => (c): If X n 4 Xoo, then also \X n \ p 4 (X^, and thus \X n \ p 4- \X 00 \ P . Thus if 



E|X n | p -»• E|Xoo|P, then {X£}„ is UI, by Lemma |8Xl8l 

(c) =>- (d): If {X£} n is UI, then it is u-P-c, by Proposition |8. 1.16 

IP 

(d) => (a): Suppose now that X n — )■ Xoo and that {X^} n is u-P-c. Let e > and choose 

5 > such that P(A) < 5 implies both EflX^; A] < e (all n G N) and E^X^; A] < e. Now 

i 

choose JVgN such that n> N implies P(|X n - X^] > ep) < 5. 
Further note that 



, p ) 



{Xn-X^ < (2max{|X n |,|X 00 |}f < 2 p (\X n \ p + \X ( 
Thus n > N implies 

E\X n -X OQ \ p = E[\X n - X^P; \X n - X^ < e)] + E[|X n - X^; \X n - X^\ p > e)} 

< e + 2 p E\\X n \ p + iXoojP; |X n - > e] 

< e(1 + 2 p + 2 p ) 

This goes to zero as n — > oo (since p is fixed). Hence X n — > X^. 

q 

(e) =^ (c): If {X n } n is bounded in C q for some q > p, then {X™} n is bounded in />. Now 
apply Proposition |8 . 1 . 16 



112 



Weak Convergence 



(f) =» (c): If \X n \ < Y and Y £ C p , then |X n p> < Y? and YP £ C 1 . Now apply Exercise 
8.1.15K 3). 



Proposition 8.1.20 Suppose that X n ,Xoo G £ p (where 1 < p < oo). 



CP 



(a) If Xn^Xoo, then X£ ±> X£ 



P C [ yP 

f&J Conversely, if the X n are non-negative, then Xn — > X^ implies X n X c 



Proof: If X n ^ Xoo, then {X£} n is u-P-c and X£ 4 Xg,. By Theorem 



.1.19 



we thus have 



y-p i\ v-p 



r 1 



Conversely, if Xn —5- X^j then X^ — > X^ and {X n| ra is u -P-c. If the X n are non-negative, 



we can conclude that X n — > X^, and then Theorem 8.1.19 guarantees that X n — > X, 



CP 



8.2 Weak Convergence and Convergence in Distribution 

We will scratch only the surface of the theory of weak convergence, also known as conver- 
gence in distribution of probability measures, and restrict ourselves entirely to measures on 
(R,B(M)). Define 

■Mi(R) := set of all probability measures on M 

and 

C&(R) := set of all bounded continuous functions K — > M 

Recall also that if X : (Q,T,F) — > (R,B(R)) is a random variable, then the law (or distribu- 
tion) of X is a probability measure fix on (M,£>(M)) defined by 

Hx{B) := PX _1 (S) = P(X G B) 



The definition of weak convergence uses both measure-theoretic and topological proper- 
ties: 



Definition 8.2.1 (a) Let fj, n , /i G Mi(M) for n G N. We say that (fi n ) n 


converges weakly 


to [i, and write 




w 




iff for each / G Cf,(K) we have 




lim / / dfi n = f dfi 

n^ooj J 




(b) If X n , X are real-valued random variables, we say that (X n ) n converj 


$es in distribution 


to X if and only if fix n ^x- 





Convergence in Probability Theory 



113 



Remarks 8.2.2 Note that, in the definition of convergence in distribution of random variables, the X n , X 
are not required to be defined on the same space — it is only their laws that matter. Nevertheless, if the X n , X 
are all defined on the same probability space (f2, T, P), then, seeing that J f dfi x = E[/(X)] by the change of 
variables formula for integrals, we have 

X„4X iff E[f(X n )] -> E[f(X)] for all / G C 6 (R) 

□ 

Exercise 8.2.3 (a) Show that if X n ^ X, then X n ^ X. 
(b) Show that if X n £ X, then X„ ^ X. 

□ 

In elementary (non-measure-theoretic) probability texts, the definition of convergence in 
distribution is often given in terms of distribution functions. Recall that a function F : R — > 
[0, 1] is a distribution function provided that 

(i) F is increasing, i.e. x < y =>■ F(x) < F(y). 

(ii) lim F{x) = and lim F{x) = 1. 

x— >— oo x— >+oo 

(iii) .F is right-continuous, i.e. limi^(y) = F(x) for all x G R. 

As i 7 is increasing, F(x— ) := lim^ -F(y) exists. .F is discontinuous at x precisely when 
-F(x) F{x— ). We shall call a point x G R where a distribution function F is discontinuous 
an atom of the distribution function. If F is the distribution function of some random variable 
x, then x is an atom of F if and only if P(X = x) = F(x) — F(x-) > 0. 

Recall also that every probability measure ^onM induces a distribution function: Simply 
define F(x) = fi(—oo,x]. The converse is also true: Any distribution function F yields a 
unique probability measure fi on (R,£>) defined by oo,x] = F(x), and which can then be 
extended to all Borel sets by Caratheodory's extension theorem. 

Definition 8.2.4 (Non-measure-theoretic definition of convergence in distribution) 
Let Fx n ,Fx be the distribution functions of random variables X n ,X. We say that X n 

converges to X in distribution, written X n A X, provided that 

F x n {x) ->■ Fx(x) as n —?■ +oo 
at every point x where Fx is continuous. 

Example 8.2.5 Consider constant random variables X n := - on some probability space (Q,J-,V). The 
law of X n is just the Dirac delta Si, i.e. the point mass at i. Clearly, we ought to have Si — > So, as the 

points i lie closer and closer to the point 0. Nevertheless, we do not have Si(B) — > 5o(B) for every Borel set: 
For example 

Si{0}^8 {0} 

n 

as 7^ 1. Similarly, we do not have 

J f(x)5±(dx)—¥ J f(x)8o(dx) for all measurable functions 

take / = /{o} or / = /(-o^o], for example. 

However, the notion of ^ lying "closer and closer" to is a topological notion. If / respects topology, i.e. 
if / preserves the "closeness" -relation, i.e. if / is continuous, then we have lim n /(^) = /(0), i.e. 

lim f f{x) 6i(dx) = lim/(i) = /(0) = f f(x) S (dx) 



114 



Weak Convergence 



□ 

Lemma 8.2.6 A distribution function can have at most countably many points where it is 
discontinuous. 

Proof: Let D := {x G M : F(x) — F(x—) > 0} be the set of all points where F is discontinuous, 
and, for n G N, define D n = {x G R : F(x) - F(x-) > £}. Clearly, D = \J n D n . But since 
< F(x) < 1 for all x, and since F is increasing, each D n can have at most n elements. 

H 



Proposition 8.2.7 Let F n ,F be distribution functions on K for n G N (m i/ie sense 
defined above), and that F n {x) — > at every point x where F is continuous. Then 

there exists a probability space {Cl,F, IP) carrying random variables X n ,X (for n G N) 
such that 

F n = F Xn , F = F X and l/4l 

Proof: Let (n,F,F) be ([0, 1], B[Q, 1], A) and define 

= inf{y G R : > uj} = sup{y G R : < 

= inf{y G R : F(y) > uj} = sup{y G R : F(y) < u)} 

and define X+(a;) and X~(u) similarly. 

Note that we always have X~{uj) < X + (uj). If uj G [0, 1], then [X~(uj), X + (oj)] is a closed 
subinterval of R, possibly degenerate, i.e. consisting of just a single point. When the interval 
[X~(uj), X + {uj)\ is non-degenerate, then F is constant with value u> on that interval. Clearly, 
therefore, if [X~{uj\), X + {uii)} = [X~ (0J2), X + (uj2)\ is non-degenerate, then ui\ = u>2- As each 
non-degenerate interval must contain a rational number, there are at most countably many 
uj for which [X~(uj), X + (uj)] is non-degenerate, i.e. at most countably many uj G [0,1] for 
which X+(uj) / X~(u). Thus X+ = X- A-a.s. 

Note that by right-continuity of F, we have 

u < F(x) <^ X~(uj)<x 

and thus 

F(X~ <x) = \{uj G [0, 1] : X~(uj) < x} = A[0, F{x)\ = F(x) 

This shows that the distribution function of X~ is F. As X + = X~ a.s., X + has the same 
distribution as X~, i.e. the distribution function of X + is also F. 

In the same way it follows that the random variables X^ and X~ have distribution 
functions F n . 

For definiteness, define X := X + . We now show that A+ ^4' X and that X~ ^4' X . For 
let uj E CI, and let y be a point of continuity of -F such that j/ > X + (oj). Then F(y) > w, and 
hence, for all sufficiently large n, also F n (y) > uj (because F n (y) — > F(y). Hence y > X£{u) 
for all sufficiently large n, so that limsup X+(lj) < y. Now since there are at most countably 
many points where F is not continuous, we must have 

limsupX+(w) < X + (u) 



Convergence in Probability Theory 



115 



In the same way, we can prove that 

liminf X~{u) > X~(u) 
Putting these inequalities together, we see that 

X~(u) < liminf X~{u) < limsupX+(w) < X + (u) 
Now since X~ = X + a.s., we must have X~ ^4' X and X+ ^4' X. 

H 

Theorem 8.2.8 Suppose that fJ, n ,n are probability distributions on R and that F n ,F are 
the associated distribution functions. Then fj, n — >■ fj, if and only F n (x) — > F(x) at every 
point i£l where F is continuous. 



Proof: First suppose that jj, n ^> fi. Let x£l and let S > 0. Define a bounded continuous 
function / by 

1 if y < x 

f(y) = 1- 5~ l {y -x) if x <y <x + 5 

0ify>x + 6 

Thus if 5 is small, then / is a continuous approximation of a step function which jumps from 
1 to at x. 

Note that I(-oo, x ] < / < ^(-oo,ai+<5] ■ Since fi n — >■ fi, we must have 
limsupFnO) = limsup / I(-oo >x ] a?^ n < limsup / / d/j, n = If dfj, < / I(-oo,x+s] dn = F(x+5) 

n n J J J J 

Using right continuity of F, we see, upon letting 5 I 0, that limsup n F n (x) < F(x) for all 
x£R. 

Similarly, define 

{1 if y < x — 5 
1 - <5 _1 (y - (x - 5)) if x - 6 < y < x 
0ify>x 

so that i"(_oo, x -5] < 9 < I(-oo,x]- Then 

Fn(x-) = /U n (-oo, x) > j g dfi n 

Since /i n /x, we must have 

liminf F n (x— ) > liminf / g dfi n = g dfi > Fix — 5) 

n n J J 

Letting 5 \. 0, we get 

liminf F n (x-) > F(x-) 

n 

for all x GR. 



116 



Weak Convergence 



It now follows that 

F(x—) < liminf F n (x— ) < limsupF n (x) < F(x) 
n n 

In particular, if x is a point of continuity of F (i.e. if F(x— ) = F(x)), then 

limF n (x) = F(x) 

n 

as required. This proves the forward direction. 

Now assume that F n (x) converges to F{x) at every continuity point of F. There is a 
probability space (17, F, P) which carries random variables X n ,X with the properties that 

F Xn = F n F X = F and l/4'I 

If / G C{,(M), bounded by a constant K, then the f(X n ) are random variables which are 
bounded by K as well. By the Lebesgue Dominated Convergence Theorem, we therefore have 

J fdfi n = J f(X n ) dF^ J f(X) dF = J fdfi 
Thus fi n [i, proving the reverse direction. 

□ 

We now seek a kind of compactness condition for probability measures on (R, 23(R)). 



Definition 8.2.9 A sequence (fi n )n of probability measures on (R,#(R)) is said to be 
tight if and only if 

sup fj, n {x : \x\ > K} — > as K — > oo 

n 

A sequence of distribution functions (F n ) on R is tight if and only if the corresponding 
probability distributions form a tight sequence. 

Remarks 8.2.10 (a) You should verify the following directly from the definition: (fi n )n is tight if and 
only if for every e > there exists a K > such that 

Hn[-K, K] > 1 - e for all n € N 

In other words, most of the mass of each \x n lies on a single compact interval [-K, K], which is the same 
for all n n . 

(b) It should be clear that a single probability distribution /i on R is tight, i.e. that, for any e > there is K 
such that )j,{x : \x\ > K} < e. Indeed, since [— n, n] t K, we have /j,[—n,n] t 1, and thus there is K such 
that n[-K,K] > 1 -e. 

In the same way, it can be shown that any finite set {hi, . . . , fin} is tight: Choose Kj so that Kj, Jj] > 
1 — e, and then define K := max{ifi, . . . , Clearly fij [-K, K] > fij[—Kj, Kj] > 1 — e for all j = 

1, . . . ,n. 

(c) If n n is the distribution of a N(m n , 1) -normal random variable, and the sequence of means (m n ) is 
bounded, then (fj> n ) n is tight. 

(d) If n„ is the distribution of a iV(0, n)-normal random variable, then (fj, n ) n is not tight. 

(e) If fi n = S n , then (u n ) n is not tight. 

□ 



Convergence in Probability Theory 



117 



Theorem 8.2.11 (Helly-Bray) 

(a) Let F n be a sequence of distribution functions on M. Then there exists a right- 
continuous non-decreasing function F : R — >■ [0, 1] and a subsequence F Hk such that 

lim F nk (x) = F{x) 

k— >oo 

at every point of continuity of F. 

(b) If, moreover, the (F n ) n are tight, then F is a distribution function, and F n ^> F. 

Proof: (a) Enumerate the rationals (or any other countable dense subset of R), i.e. Q = 
{q n : n G N}. Since the sequence 

{F n {qi))n neN 

is bounded (because it lies in [0, 1]) it must have a convergent subsequence, by the Bolzano- 
Weierstrass theorem, i.e. there exists a sequence (n^)k in N such that F (i)(qi) converges to 
some value F(q±) as k — > +oo. Since F (i)(<?2) is bounded, it has a convergent subsequence, 

i.e. there exists a subsequence {ni^)k of (n^)k F (2) (92) converges to some value F^q?) as 
k —7- +00. Since a subsequence of a convergent sequence converges, and to the same value, we 
also have F (2){qi) — > F(q\). Keep going in this way: Since F (2) (93) is bounded, there is a 

(3) (2) ~ 

subsequence (n, )k of {n, )k such that F (3)(qs) converges to some value -Ffe). Then, since 
(nj^)fc is a subsequence of (n^)jfc and thus also of (n^)^, we must also have F ( 3 ) (92) — > ^(92) 
and F ( S) (gi) -> F( qi ). 

k 

(i) (k) th 

We now pick the diagonal of the sequences n k : Define := n k . Note that the k tail 
{n rn : m > k} of {nk)k is a subsequence of (rff)j for each i > k. It follows that 

F nk ( qi ) ^ F{ Ql ) forall^GQ 

Clearly F is an increasing function Q — > [0, 1], but it hasn't been defined for all i6l For 
every real number x, however, we can find a strictly decreasing sequence of rationals q \. x. 
Thus define 

F(x) = lim F(q) 

qix 

where q strictly decreases to x. 

Then F has all the required properties. (Note, however, that F(q) and F(q) may not be 
equal). F is clearly increasing To see that it is right-continuous, let x G R and e > 0. Choose 
g£Q such that x < q, and F(q) < F(x) + e. This can be done by definition of F. Then 

F(x) < F(q) < F(x) + e 

which proves right-continuity. 

Finally, suppose that F is continuous at x. If e > 0, we may choose y < x such that 
F(x) — e < F(y) (by continuity). We may also pick q, r G Q such that y < r < x < q and 
F(q) < F(x) + e. Since 

F(x) - e < F(r) < F(q) < F(x) + e 



118 



Characteristic Functions 



and 

F n {r) < F n {x) < F n {q) 

it follows that 

F(x) — e< lim inf F rih (r) < lim sup F rik (x) < F{x) + e 

Thus F nk (x) —7- F{x) at every point x where F is continuous. 

(b) We need only show that the function F obtained in the Helly-Bray Lemma exhibits the 
correct end behaviour, since we already know that it is non-decreasing and right-continuous. 

But this is easy: For example, to show that lim Fix) = 1, proceed as follows. Let 

e > 0. By tightness, we can find a K > such that F n {K) — F n (—K) > 1 — e. Moreover, 
K can be taken to be a point of continuity of F, because F has at most countably many 
atoms. Then F n (K) > 1 — e for all n. Thus if x > K is a point of continuity of F, then 
F{x) > F(K) = lim F nk {K) > 1 - e. 

k— >+oo 

H 

Translating from the language of distribution functions to that of distributions, we obtain 



Corollary 8.2.12 If (fi n ) n is a tight sequence of probability distribution on (K, 23(M))J 
then it has a weakly convergent subsequence, i.e. there is a probability distribution /j, and 
a subsequence (fJ. nk )k so that [i nk /i. 



8.3 Characteristic Functions 
8.3.1 Basic Properties 

In this section we introduce the characteristic function of a random variable. This is intimately 
related to the Fourier transform, although this kinship will become aparent only later in this 
section. 



Definition 8.3.1 (a) If /x is a probability measure on (M.,B), then the characteristic 
function of /j, is the function ip : R — > C defined by 



v (t) = J e ltx n{dx) 

(b) If X is a random variable on a probability space (17, F, P), then the characteristic 
function R — > C of X is defined to be the characteristic function of the distribution 
fix of X. 



Remarks 8.3.2 (a) Clearly identically distributed random variables have identical characteristic func- 
tions. We shall soon prove the converse to this, i.e. that random variables with identical characteristic 
functions are also identically distributed. 

(b) Suppose that X is a random variable on a probability space (Q, F, P) . By the Change of Variables Theorem, 
we have 

tpx(t) = f e ltx d^ x = f e itx dP = E[e ux ] 
Jk J a 

The characteristic function of a random variable X is therefore frequently defined by 

ipx{t) =E[e' tx ] 



Convergence in Probability Theory 



119 



(c) In a PDE's course, the Fourier transform F = J-(f) of a piece- wise continuous real-valued function function 
f(x) is generally defined by 

F(t) = / e tx f(x) dx 



Thus, if X is a continuous random variable with density fx, then the characteristic function of X is just 
the Fourier transform of fx (apart from a factor of \/27r)- To be precise: 



(fix = V2nJ r (fx) 

(d) Of course, 



<px(t) = Jcos(tX)d¥ + i Jsin(tX)d¥ 
= Ecos(tX) + iEsin{tX) 

□ 



Because the characteristic function of a continuous random variable is just the Fourier 
transform of its density function (up to a constant factor) the two have very similar be- 
havioural properties. Note, however, that the characteristic function of a random variable 
X is defined even if X is not continuous: Since \e itx \ = 1, the random variable e ltx = 
cos tX + i sin tX is bounded, and hence integrable. 

Proposition 8.3.3 (Basic Properties of Characteristic Functions) 
Let (fx be the characteristic function of some random variable X. 

(a) <p x (0) = 1 

(b) \<px(t)\ < 1 for allteR 

(c) (fx is continuous (in fact, uniformly continuous) 

(d) <p aX +b(t) = e ibt cp x (at) 

(e) <p-x(i) = <px(-t) = <px(i) (= complex conjugate of(p x (t)) 



Proof: (a) is obvious. 

(b) follows from the fact that | f f dfi\ < f \ f\ dfi. 

(c) Note that e luX — > e ltx as u — > t. The result follows by the Lebesgue Dominated Con- 
vergence Theorem, as the family of e luX is dominated by an integrable random variable 
(\e luX | = 1). Uniform continuity is now easy to see. 

(d) is straightforward. 

(e) follows from the fact that both tp_x{t) and tpx(— t) are equal to E[e _ltx ]. Now 

E[e- itx ] = Ecos(tX) - iEsm(tX) = E cos(tX) + iE sin(tX) = E[e itx ] 

H 

The following trivial result is nevertheless of great importance: 

Theorem 8.3.4 // X, Y are independent random variables on some probability space 
(n,T,F), then 

^x+v{t) = ipx(t)ipy(t) 



120 



Characteristic Functions 



Exercise 8.3.5 Prove the preceding theorem. 

□ 

Examples 8.3.6 We give here some examples of characteristic functions of random variables: 

(a) Normal distribution: Assume that X is normally distributed with mean and variance 1. Then it has 
density function 

1 x 2 

fx{x) = -==e—r 

V 



Thus 



<fix (t) = , / e 2 cos tx dx H == / e 2 sin tx dx 

V2^ J-c ■ 



„2 A f + oc ^ 2 

/2tt 



However, e x 'shite is an odd function of x, and thus the second integral on the right is 0. It follows 
that 

^ /*+oo ^2 

<fix(t) — / e~~ cos tx dx 

V27T J-oo 

Differentiating with respect to t yields 

<p'x(t) — / —xe~^~ sinte dx 

V27T J-oo 

If we integrate this integral by parts, we obtain 

t r+° c x 2 

<Px(i) — 1= I e _T coste dx 

y/2n J-oo 

= -tipx(t) 

This is a first-order separable differential equation with solution 

_ tl 

ipx{i) = Lpx(0)e 2 

Since <px(0) = 1, we thus obtain 

tpx(i) = e 2 

Now if Y = aX + /i, then Y is a normally distributed random variable with mean /j, and variance a 2 . It 
follows that 

(py(t) = if a x+^{i) = e ltJ,t ipx{o-t) 

(b) Suppose that X is a Bernoulli variable, with F(X = 1) = P(X = -1) = \. Then 

^x(t) = Ee ltx = ^ + ^ = cost 

(c) Suppose that X is Poisson distributed with rate A. Then 

,,\ A -A itfe 

00 l\J.t\k 



(Ae 1 

A(e"-l) 



(d) If X is uniformly distributed over [a, 6], then 



it(b — a) 

□ 



Convergence in Probability Theory 



121 



Characteristic functions and moments are related in the following way: 
Proposition 8.3.7 Suppose that a random variable X has an n th moment, i.e. that 

E\X\ n < +00 

Then its characteristic function (fx(t) has a continuous derivative of order n, and 

<p%\t) = E[(iX) n e itx ] 



Proof: Note that E[X n e itx ] exists if and only if E\X\ n < +00, i.e. if and only if X has an 

n th momen ^ 

We tackle first the case n = 1. Fix t G R, and suppose that EX exists. Then 
ton + f\u x e^-l 

h-»0 h 

But 



/ e ihx _ j 
& itx Lix(dx) 
h 



I 



Ahx 



1 



e 



<l*l 



(You may wish to consult the "circle inequality" proved later in this chapter.) Thus we may 
apply the Lebesgue Dominated Convergence Theorem to deduce 

<p x (t + h)-<p x (t) f itxy e lth -l 

hm - — j - — — = / e hm Ux(dx) 

h^O h J h^O h 

= J ixe ltx nx(dx) 

which proves 



(p' x (t) =iE[Xe itx ] 

This yields the result for n = 1. Repeating the argument will establish the result for higher 
n. 

H 

It follows that one may calculate the moments of a random variable directly from its 
characteristic function. Putting t = in the above yields: 



Proposition 8.3.8 If X is a random variable with characteristic function ipx{t), then 

EX n = r>^ ) (0) 



8.3.2 Inversion 

The main result about characteristic functions is that random variables whose characteristic 
functions are equal are identically distributed, i.e. ipx = if and only if fix = My- The 
distribution of a random variable can be recovered from its characteristic function. This result 
follows from the following theorem: 



122 



Characteristic Functions 



Theorem 8.3.9 (Levy's Inversion Formula) 

Let (p be the characteristic function of a random variable X, which has distribution jjl and 
distribution function F . If a < b € R, then 

j rT g—ita g— itb 

lim — / ip(t) dt 

T^+oo 2tt J_ t it 

= ^fi({a}) + fi(a,b) + -fi({b}) 

= 1 -[F(b) + F(b-)]- 1 -[F(a) + F(a-)} 



Remarks 8.3.10 Note that if F is continuous at a and 6, then Lt({a}) = = Li({b}). In that case we have 



rT ^ — ita ^ — itb 



, lim . ^ / ~ t vifidt 

= F(b) - F(a) 



□ 



It follows from Levy's Inversion Formula that the distribution of a random variable X is 
determined by its characteristic function. For suppose that X, Y are random variables which 
have the same characteristic function ip. Let [ix be the distribution of X, and let fxy be the 
distribution of Y. Then for any real numbers a < b, we must have 

2»x({a}) + nx{a, b) + -fi X ({b}) = ^y{{o}) + fi Y (a, b) + ^y({b}) 

There are at most countably many real places where a distribution function can be discontin- 
uous. It therefore follows that there are sequences a n I a and b n f b such that both Fx, Fy 
are continuous at each a n and b n . It follows by (a) that fix(cLn,b n ) = fiy(a n ,b n ) for each n. 
Now since (a n , b n ) f (a, b), by continuity of measure we have 

(i x (a, b) = ny(a,b) 

Thus hxiHy are finite measures which agree on all open intervals. But the open intervals 
form a 7r-system which generates the Borel algebra, and thus fix must agree with fiy on 
every Borel set. We have shown: 



Theorem 8.3.11 Two random variables have the same characteristic functions if and 
only if they have the same distribution. 



t 2 

Examples 8.3.12 (a) If a random variable X has a characteristic function <px{t) = e~^~ , then X must 
be normally distributed with mean and variance 1. This follows from Example 7.3.8 and the Levy 
Inversion Formula. 

(b) Suppose that X, Y are independent normally distributed random variables with mean and variance 1. 
Then 

— t 2 

(px+r(t) = <px(t)<PY{t) = e 

Thus X + Y has the same distribution function as a normally distributed random variable with mean 
and variance 2. Hence X + Y is a normally distributed random variable with mean and variance 2. 



Convergence in Probability Theory 



123 



(c) If <px(t) is real-valued (as opposed to complex- valued), then 



<fi-x(t) = <fix(t) = ipx{i) 

Thus ipx is real if and only if X and —X are identically distributed, i.e. if and only if X is distributed 
symmetrically about the origin. As a consequence, all odd moments must be zero (if they exist). 

□ 

Before we prove that Levy's Inversion Formula holds, we need a definition and some 
lemmas: 



Definition 8.3.13 Define the function S(T) for T > by 


S(T) = ^ 


fT ■ 

/ sin x . 
/ ax 


'o x 


Also define the function sgn(x) by 






' 1 if x > 


sgn(x) = < 


if x = 




k -1 if x < 



The function x sin x itself is not Lebesgue integrable over the reals, because its positive 
and negative parts both integrate to infinity (exercise!). Nevertheless, ^lim J Q T x" 1 sinx dx 

exists, because the sequence 



Ju 



x 1 sin x dx 



(n—l)n I 

is alternating and goes to as n — > +oo. It is important to know what this limit is: 
Lemma 8.3.14 

f T sinx 7r 
lim / dx = — 

T^+oo J X 2 



Proof: Note that 



J o 



sin x dx = l + u2 [1 - e (u sin T + cos T)\ 



which may be obtained by two successive applications of integration by parts. Now the 
function e~ ux sinx is integrable over (0, T) x (0, +oo), because the integral of its modulus is: 



T roo 



r [ 

o Jo 



: sinx! du dx 



/ x 1 1 si 
Jo 



sinx! dx < T < +oo 



using |x sinx | < 1. We may therefore apply Fubini's theorem: 



ux du 



/ dx = sinx / 

Jo x Jo Uo 

rco r rT 

Jo Uo 

r°° du r°° t 
~ J l + u* ~ J T 



dx 
du 



oo e -uT 



+ U 2 



[u sin T + cos T) du 



124 



Characteristic Functions 



Clearly J Q C 



OO du _ TT 



l+u 2 2 

oo -uT 



Furthermore 



< 



poo q—uI 

/ — 2~( usi 

Jo l + u 2 



sin T + cos T) 



poo 










1 + U 2 



|itsinT + cosT| du 



POD 

< / e" uT (ti + 1) du as T ^ oo 

JO 



by the Lebesgue Dominated Convergence Theorem, so that J °° j I ^(wsmT + cosT) <iu — > 
as T — > +oo. This is what we needed to establish. 



Another technical result that we shall need is the following: 
Lemma 8.3.15 



rT f^ dt= sgn(x)S(\x\T) 
2tt J_ T it TT 



-[ 

2vr J_ q 



Proof: 



T cos tx + i sin tx , f T sin tx , 
dt = / dt 

rp it J — T ^ 



. cos tx , sin tx 

because is an odd function of t. Moreover, using the fact that is even and a 

it t 
change of variables t >->■ ^, we obtain 



/sintx , / \ / " sinu , 
— - — at = 2sgn(x) / <7» 



u 



from which the result now follows trivially. 



The final lemma will be useful in giving upper bounds for certain functions: 

Lemma 8.3.16 ("circle inequality") 
For with u < v we have 

\e iv -e iu \ <\v-u\ 
Proof: This follows from the fact that 

PV 

< / \ie lt \ dt = v -u 







/ ie lt dt 


<-j 


J u 


J u 



(or more simply from a diagram: Just draw e lv and e lu on the unit circle in the complex plane 
and think about it.) 



Convergence in Probability Theory 



125 



We may now prove the Levy Inversion Formula: 



Proof of Levy's Inversion Formula: 

If a < b in R, and if < T < +00, then 



/T c —ita 
-T 



-itb 



it 



-<p(t) dt 



_ , ,r e -„„_ e -„ t / , 

1 f f f T e it( y x ~ a ) — e it(x-b) \ 



_T ^ / 

by Fubini's Theorem, since the integral of the absolute value is finite: By Tonelli's Theorem, 



2tt / 



git(x— a) git(x—b) 



T 



< 



[—T,T] x 

-/(/ 

(6 - a)T 

IT 



it 

git(a:— a) _ ^it(x-b) 



it 



\b — a\ dt) jiidx) 



(A (8) fi)(dt, dx) 
dt fx{dx) 



using the inequality on the unit circle above. 
Using one of the lemmas above, we obtain 

1 1'^ e it( x — a ) _ git(x— 6) 



(it 

27T ./_ T zt 

sgn(x - a)S(\x - a\T) - sgn(x - b)S{\x - b\T) 



7T 



Now as U t +00, S(U) ->■ § , and thus 



lim 



sgn(x - a)5(|x - a|T) - sgn(x - b)S(\x - b\T) 



if x < a 

1 -f 

- it x = a 

1 if a < x < b 



7T 



1 



if x = b 



Oifb<x 



Hence 



sgn(x - a)S(\x - a\T) - sgn(x - b)S(\x - b\T) . , 
tI^oc n = 2 ^ + T ^ + 3 Z 



126 



Characteristic Functions 



It follows that 

j rT g—ita £—itb 

lim — / (pit) dt 

T^+oo 2-K J_ T it 

f sgn(x - a)S(\x - a\T) - sgn(x - b)S(\x - b\T) 

r-H-oo./ K 7T / ' '' ' 

sgnix — a)S(\x — a\T) — sgnix — b)S(\x — b\T) 
lim ii(dx) 

= ^(ia}) + »(a,b) + -n{{b}) 

where we used the Lebesgue Dominated Convergence Theorem to take the limit inside the 
integral. This proves the result. 
Note also that 

\ [F(b) + F(b-) - F(a) - F(a-)} = \ [ M (a, b] + M [a, b)} 

H 

An extremely useful (but slightly weaker) version of the inversion theorem is the following: 



Theorem 8.3.17 Let p be the characteristic function of a random variable X, which has 
distribution \i and distribution function F. If 



/ \p(t)\ dt < +00 
Jk. 



then X has a continuous probability density function fx, and 

fx(x) = -}- f e~ u *p(t) dt 

^7T /in 



Exercise 8.3.18 Wc prove Thm. |8.3.17| 



(a) Show that f R - ^ dt < \b — a\ f R \tp(t)\ dt and conclude that - ^ (pit) is integrable over 

R (with respect to t). 

(b) Explain why 

F(b)-F(a)<^^J\ V (t)\dt 

and why this, together with the fact that f R \<p(i)\ dt < +00, immediately implies that F is continuous. 

(c) Explain why 



F(b) - F(a) = — J <p(jk) dt 

for all a < b £ R. 
(d) Deduce that that for all x and all h, we have 

F(x + h) - F(x) _ 1 r e- Ux - e- lt{x+h) 



(pit) dt 

h 2tt J r ith ^ v ' 



(e) Use the Lebesgue Dominated Convergence Theorem to conclude that 



f x (x) = F'(x) = ±- I e- u *p(t)dt 

Z7T 



as required. 



Convergence in Probability Theory 



127 



(f) Finally, explain why fx must be a continuous function. 

□ 

Remarks 8.3.19 If X is a random with a continuous probability density function fx, then 
we have the following "duality" : 

<px(t) = [ fx(x)e itx dx 
fx(x) = ±- f px(t)e- Ux dt 



Thus the density and characteristic functions are inverses of each other under Fourier trans- 
forms. 

□ 

8.3.3 Weak Convergence and Characteristic Functions 

Our next result proves that weak convergence of measures is equivalent to pointwise conver- 
gence of characteristic functions: 

Theorem 8.3.20 (Levy Continuity Theorem) 

Let fi n be probability measures on (R, B), and let (p n be the associated characteristic func- 
tions. Suppose that (tp n (t))n converges for every t G R, and that 

ip(t) := lim<p„(t) 

n 

If ip is continuous att = 0, thentp is the characteristic function of a probability distribution 
fi, and n n —> n 

Proof: Suppose first that (fi n )n is tight. Then by the Helly-Bray Lemma there is a subse- 
quence {Hn k )k an d a probability distribution fi such that \i nk \x. Now e ttx = costx + i sin tx 
is continuous, and hence, by definition of weak convergence, we have f e ltx /j, nk (dx) — > 
J e ltx fi(dx), i.e. v ? n fc (*) ~~ y <P(t)i where ip is the characteristic function of \i. But as 
Lfnit) —> ip(t), the same is true for any subsequence. Hence <p{t) = ip(t). 

So far, we know only that fi nk — > fi, but we would like to show that fi n — > fi. It is easier 

to argue with distribution functions: If fi n ^ fi does not converge weakly to fi, then F n F 
(where, of course, F n ,F are the distribution functions of // n ,/u), which means that there is 
a continuity point x of F such that F n (x) -fr F(x). It follows that there is a e > and a 
subsequence (F m .)j such that \F m \x) — F(x)\ > e for all j. Applying Helly-Bray Lemma to 
this subsequence, we obtain a subsubsequence (F mj )i which converges weakly, by tightness, 
to some distribution function G, with characteristic function ipa- But since tp n (t) — >■ we 
must have ip mj (t) — > ip(t), and hence (pc{t) = ip(t) = <p(i). Thus G and F have the same 
characteristic function, and hence G = F, by Levy's Inversion Theorem. But this leads to a 
contradiction: As F mj {x) — > G(x) and G{x) = F(x), we have F mj( (x) — > F{x) as / — > oo. Yet 
\F mj (x) — F(x)\ > e for all j G N, and thus we have both 

\F mj (x) — F(x)\ > e for all I and \F mj (x) — G(x)\ < e for all large I 



128 



Characteristic Functions 



which is impossible. 

It just remains to show that the (/x n ) n are tight: Now if 5 > 0, then 

s s 
(T 1 f 1 - <p n (t) dt = 5- 1 [ [ 1 - e itx n n (dx) dt 

J -6 J—S JR 

= [ S' 1 [ 1 - e itx dt n n (dx) 
Jr J-s 

> n n ({x : |z| > 2(T 1 }) 

Here we used Fubini's Theorem to change the order of integration (which is permitted, since 
1 - e itx is bounded by 2). We also used the fact that 1 - > 1 - ^ and that 1 - ^ > \ 

when \x\ > 25~ 1 . 

Now clearly ^(0) = lim<p n (0) = lim n 1 = 1. As ip is assumed to be continuous at t = 0, 
there is for every e > a (5 > such that 



S' 1 J 1 - p(t) (ft < £ 



Since <p n (t) — >■ ip(t), the Lebesgue Dominated Convergence Theorem implies that there exists 
an N such that 



5' 1 J l-ip„(t)dt<£ foralln>iV 



It follows that 

flni-25- 1 ,2s- 1 } >l-£ 

for all n> N. Also pick K\, K2 . ■ ■ , Kpj > such that 

fi n [-K n ,K n ] > 1 - £ forn = l,...iV 
If K := max{2<r\ K u K 2 , . . . , ifjv}, then 

iLt n [— if, /f] > 1 - e for all n G N 
proving that (fJ. n )n is indeed tight. 

H 

Now if /x n ,/x are probability measures on (M,2?(K)), with characteristic functions tpmty, 
and jLt„ ^> /i, then we must certainly have J e ttx fi n (dx) — > J e ttx fi(dx), by definition of weak 
convegernce, as e ttx = costx + isintx is continuous. Thus 

fi n ^ fi implies f n {t) - ► <p{t) f° r all i € M 

Conversely, if <p n (t) — > y>(t) for all t € M, then, by the previous theorem, \x n fi, as 92 - 
being a characteristic function — is continuous, and thus continuous at t = 0. We thus have: 



Convergence in Probability Theory 



129 



Corollary 8.3.21 Suppose that fx n , fx are probability measures on (W,B(M.)), with charac- 
teristic functions (p n ,ip. Then 

[i n ^> fi if and only if f n (t) — > <p(t) f or allt&M. 



In the proof of the Levy Continuity Theorem, we proved the following useful inequality: 

Proposition 8.3.22 If (p is the characteristic function of a probability measure fx on M, and 
ifS>0, then 



1 [ 1 - ip(t) dt > fi({x : \x\ > 25' 1 }) 
J-s 



□ 



.1 The Central Limit Theorem 

The Central Limit Theorem is one of the fundamental results in mathematics, and is easy to 
prove with the machinery set up in the previous two sections. It states that if X n is a sequence 
of independent identically distributed random variables with mean [i and variance a 2 , then the 
distribution of the fractions (Xi • • • + X n — nfi) / 0\fn tends to a standard normal distribution 

(with mean and variance 1). Why the y/n? Basically, if we define S n = X\-\ \-X n , then 

it is clear that S n has mean n\x and variance no 1 . Thus as n — > +oo, var(SV t ) — > +oo as well. 
However, S n /y/n has variance v&r(S n )/n = a 2 . 

We can conclude that if n is sufficiently large, then S n is approximately normally dis- 
tributed with mean n\i and variance no 2 . Thus the sums of identically distributed random 
variables tend to become normally distributed. 

Here is a concrete exercise which contains all the important elements of the proof of the 
Central Limit Theorem: 

Exercise .1.1 (a) Suppose that X is a random variable with 

F(X = 1) = \ = F(X = -1) 
and that ipx is characteristic function of X. Show that 

(px(u) = cos u 

(b) Now let X n (for n G N) be independent random variables, all with the same distribution 
as X in (a). For n G N, define random variables G n by 

n _ X\ + Xi + • • • + x n 

<Jn 1= 

In 



Show that the characteristic function ipc n of G n is given by 

<p G » = (co S ^r 

(c) Use a Taylor expansion of cos x about x = to show that 

cos = 1 + £ (-1 + e(^)) where e{h) -»• as h -»• 



130 



The Central Limit Theorem 



(d) Now define k := k(^) := -\ +e(^=) so that 



2 k 



k — y — o as n — )• oo and cos -7= = 1 + 

By considering ln^c^it), and with the aid of a Taylor expansion of ln(l + x) about x = 0, 
show that 

1 2 

lim fG n ( u ) = e ~* U 
n— »oo 

(e) Now explain why we may now deduce that G n ^> Z, where Z is a standard normal 
random variable. 

□ 



Theorem .1.2 (Central Limit Theorem) 

Le£ N be a standard normally distributed random variable, with mean and variance 1. 
Let X n be a sequence of independent identically distributed random variables with mean fi 
and variance a 2 < +00. Define S n = X\ + X2 + • • • + X n and set 

_S n -nn 

— 1=— 

Then 

G n ^N 



Proof: It clearly suffices to prove the theorem for independent identically distributed X n 
with mean zero and variance 1, because if the X n have mean \x and variance a 2 , then X n := 
(X n — fx)/a have mean zero and variance 1 , and then 

So let (X n ) n be a sequence of independent random variables with mean zero and variance 
1. Define 

n _ X\ + X2 + ■ ■ ■ + X n 

— 7= 

vn 

Then because the X^ are independent, we can say that 



n 

v / 



fc=l ^ 

where tp is the common characteristic function of the X^. Now if we consider a Taylor 
expansion for ip about t = 0, we'll obtain something like 

<p(t) = <p(0) + <p'(0)t + \<p"{<d)t 2 + e(t)t 2 

where e(t) — > as t — > 0. Taylor's Theorem applies because X has a second moment, so that 
ip is twice differentiable. Now examining the moments, we obtain 

99(0) = 1 ip'(0) = (p"(0) = -1 



Convergence in Probability Theory 



131 



Putting this back into the Taylor expansion for <p yields 

<p(t) = l-\t 2 + e{t)t 2 = ! + [-!+ e(t)]t 2 = 1 + k(t)t 2 



where k(t) — > — \ as t — > oo. 
Thus 



m.(*) = v(^) 



k{^)t 2 

n 



Now recall that the first order Taylor expansion of ln(l + z) about x = yields 
ln(l + z) = z + where e(z) — > as z — > 



Put z := — ^ — , so that z — > as n — > oo, to obtain 



hnp Gn (t) = nln\ 1+ 7 



and hence 



n J 

nln(l + z) 

n[z + e{z)\z\) 
k(^)t 2 + e{z)K^)t 2 

t w2 , i- -f„\ U ( t \ + 2 _ 1 .2 



liming) = Umfc(^)* 2 + lime(z)fc(^ = 
It therefore follows that 

V9 Gn (t) ^e-' 2 / 2 for alU 

By Levy's Convergence Theorem, the distributions jJL n of G n converge weakly to a distribution 
whose characteristic function is e~ l I 2 . But this is the characteristic function of the standard 
normal distribution, so by Levy's Inversion Formula it follows that G n weakly converges to a 
standard normally distributed random variable. 

H 



132 The Central Limit Theorem 



Appendix A 

Sets, Logic and Functions 



A.l Logic and Formal Language 

We introduce here a formal language for talking about mathematical objects. This lan- 
guage is very precise, and unambiguous — properties which are largely absent from spoken 
languages such as English, but obviously essential for mathematics. But, as a result, this 
language is rather restricted in scope. The reason we use it is to make certain statements 
amenable to logical analysis. The purpose of logical analysis is to decide whether a par- 
ticular sentence/expression (e.g. about mathematical objects) is true (T) or false (F). A 
sentence/expression that is either true or false (but not both!) is called a statement. 

Example A. 1.1 Here are some typical examples of statements: 

• 1 + 1 = 3. 

• All apples are red. 

• The equation x 2 + 2x + 1 = has a real root. 

• Either x 2 + a = has a real root, or a > 0. 

• There exist are infinitely many prime numbers. 

• Every continuous function is differentiable. 
Note that a mathematical statement need not be true. 

□ 



Exercise A. 1.2 Which of the following are 
whether it is true or false. 

1. 1 + 1 

2. 3 is greater than 0. 

3. \/2 is an irrational number. 

4. x 2 -1 = 0. 



statements? For each statement, try to decide 



133 



134 



Logic and Formal Language 



5. If x = 1, then x 2 - 1 = 0. 

6. If x 2 - 1 = 0, then x = 1. 

7. The moon is made of cheese. 

8. The moon is a tasty snack. 

9. If the moon is made of cheese, then the moon is a tasty snack. 

10. The sentence 4> defined by 

4> = "The sentence <j> is false" 

11. All unicorns are white. 

12. All unicorns are pink. 

□ 

More complicated statements in our formal language are built up from a collection of symbols, 
including amongst others 

• Symbols for objects and relations; 

• Logical Connectives; 

• Quantifiers; 

We will briefly discuss each of these in turn. None of this material is difficult, though it may 
take a little while to get used to. 

A. 1.1 Symbols denoting Objects, Operations and Relations 

When doing mathematics, we use symbols to denote certain mathematical objects, operations 
and relations. For example, the expression 

x + 3 < \pK 

contains the following symbols: 

(i) Symbols denoting fixed objects, namely the constants 3 and ir; 

(ii) A symbol denoting a variable object, namely x; 

(iii) Symbols denoting operations, namely +, J\ 

(iv) A symbol denoting a relationship, namely <; 
So our language will contain symbols for 

• Variables: Typically we use the symbols x,y, z,x\,X2,xs . . . 

• Constants: e.g. 0, 1, 2, . . . , or ir, etc. 

• Functions/Operations: +, -,^/, U,fl 

• Properties and Relations: e.g. =, <, >,G,C, etc. 



Sets, Logic and Functions 



135 



A. 1.2 Logical Connectives 

Once we are able to make basic statements such as 1 > and x = 3, we are able to combine 
them using the logical connectives and, or, implies (then), not to make new statements such 
as 

(1 > 0) and (x = 3); If x > then y = 1; x^O 



A 


and 


— 1 


not 


— > 


implies, then 


V 


or 




if and only if 



In our formal language, these connectives have precise meanings: If cp, ip denote statements, 
then 

cp A ip is true <^=^ both cp, ip are true. 

cp V ip is true <^=^> at least one of cp, ip is true, perhaps both. 
cp —7- ip is true <^=^ whenever is true, so is ip 

i.e. it is not the case that (p is true but ip is false. 
cp <H> ip is true if &oi/i — >■ V', tp — > 4> are true 

i.e. if cp, ip are simultaneously true, or when they are simultaneously false. 
—i0 is true <S=^ (p is false. 

Here is a truth table for the logical connectives: 



<p> 


rj> 


(p Alp 


cpVip 


cp^ip 


(p^ip 


-0 


T 


T 


T 


T 


T 


T 


F 


T 


F 


F 


T 


F 


F 


F 


F 


T 


F 


T 


T 


F 


T 


F 


F 


F 


F 


T 


T 


T 



This means, for example, that if <p is true and tp is false — in the second row of the table - 
then cp A ip is false, 4>V ip is true, <p — > ip is true, etc. 

Now it is extremely important to note that the logical use of and A, or V, and implies 
—7-, though related to their common usage in English, is certainly not identical to it. In 
particular the truth value T or F of an expression such as <p A ip, cp V ip, <p — > ip etc. depends 
only on the truth values of cp and ip, and not on any meaning that the statements cp, ip might 
possess! Let us discuss some of the pitfalls: 



• And, A: 



<p 


iP 


cp A ip 


T 


T 


T 


T 


F 


F 


F 


T 


F 


F 


F 


F 



To say that ip A ip simply means that both cp and ip are true. It does not assert any 
connection (causal or otherwise) between cp and ip. This is not typically true in English. 
With the English and, the following sentences have rather different meanings, but with 
the logical and they mean the same thing: 



136 



Logic and Formal Language 



1. Alice got drunk and failed her test. 

2. Alice failed her test and got drunk. 



• Or, V: 



<p 


ip 




T 


T 


T 


T 


F 


T 


F 


T 


T 


F 


F 


F 



(j) V ip is true precisely when at least one of <p, ip is true, possibly both. In particular, it 
is not exclusive-or ("either. . . , or. . . "). Thus the statement 



(1 > 0) V (5 is a prime number) 



is true. 



• Implies, Then, 



(p 


ip 


4> v ip 


T 


T 


T 


T 


F 


F 


F 


T 


T 


F 


F 


T 



The statement <f> — > ip is true if whenever cp is true, then so is ip. In particular, 



<p — > ip is false if and only if <p is true but ip is false. 

There are severe differences between the English usage and the mathematical usage of 
implies. In English usage, implies (or then) usually involves a causal connection, as 
in "If it is raining, then it is wet outside." It is wet because of the rain. But such a 
connection is irrelevant for the logical then. For example, the statement 

(1 > 0) —7- (5 is a prime number) 

is true. Of course, the reason that 5 is prime is not because of the fact that 1 > 0!! 
There is no causal connection. 

We repeat: A logical <p — > ip statement is false only when <p is true and ip is false — just 
look at the truth table. 

— In particular, if ip is true, then (p — > ip is also true, no matter what (p might be. 

— Even more surprisingly, if (p is false, then <p — > ip is true, i.e. a false statement 
implies any other statement. In particular 

(0 = 1) — > (The Moon is made of cheese) 

is true. 

Exercise A. 1.3 (a) Two statements P, Q are said to be logically equivalent — and we write 
this as P <s=> Q — if and only if P, Q have the same truth value. There is an algorithm 



Sets, Logic and Functions 



137 



to check if two statements are logically equivalent: Simply construct a truth table for 
P, Q and show that the truth values for P, Q are always the sameQ Show that 



that 

^(0vv) <=> HO a HO 

that 

^(^vv) H)aH) 

and that 

->. ^ <=^. H) v V 

[Hint: For the first equivalence, construct the truth table 






i> 


4> v ip 


■"(0 v VO 


-0 


-nip 


H) a HO 


T 


T 












T 


F 












F 


T 












F 


F 













The truth value entries in the V ^-column and the A (-n/;)— column should be 
identical (or else you've made a mistake). This means that -i(</> V tp) is true precisely 
when (—>(ft) A (—>ip) is true, and hence that the statements are equivalent. Repeat for the 
other equivalences that must be shown.] 

(b) Show that 

(0-^) ^ 

This is important in proofs: To show that ^ follows from it is enough to show that if 
ip fails to be true, then <p also fails to be true. 



□ 



Exercise A. 1.4 A proof by contradiction works as follows: To prove that a statement P 
is true, it is enough to show that there is a known false statement Q so that —>P — > Q is true, 
i.e. so that assuming that P is not true leads to a false statement. We may then conclude 
that P is true (for if P were false, then —>P would be true, and since —>P — > Q is true, we may 
conclude that Q is true — contradicting the fact that Q is known to be false.) 

Can you demonstrate the above reasoning using a truth table? 
[Hint: Construct a truth table as above, and then remove rows that contradict what you 
know. You know Q is false, so you can remove rows in which Q is true. You know —>P — > Q 
is true, so ... ] 

□ 



1 This method probably appears first in Ludwig Wittgenstein's Tractatus Logico-Philosophicus, but he 
undoubtedly cribbed the idea from Gottlob Frege's Begriffschrift. 



138 



Logic and Formal Language 



A. 1.3 Quantifiers 

Many mathematical statements assert the existence of a mathematical object with certain 
properties. For example to say that 

x 2 — 1 = has a real root 

is to say that there exists a real number c such that c 2 — 1 = 0. 

Other mathematical statements assert that something is true for all objects (of a prespecified 
type), for example 

For every real number x, x 2 > 0. 
We therefore introduce the following symbols for quantifiers: 



For all 
There exists 



A quantifier always occurs in conjunction with a variable, i.e. as Vx or as 3x. Thus if 
4>(x) is a statement about x, then 

Vx<f)(x) is true iff the statement 0(x) is true for every x 

Frequently, if we want to restrict the domain to a particular set X, we may also write Vx G 
X cp(x) or 3x G X (f)(x). Thus 

(3x G X)cj)(x) is true iff there is at least one x G X for which the statement cj)(x) is true 

Thus the statement 3x G M(x 2 — 1 = 0) asserts that the equation x 2 — 1 = has a real 
root. 

The statement Vx G R(x 2 > 0) asserts that the square of any real number is non-negative. 
Exercise A. 1.5 Decide if the following sentences about real numbers are true or false: 



(a) 


3x G K(x 2 = -1) 






(b) 


3x G N(4x = 1) 






(c) 


3x G M(4x = 1) 






(d) 


Vx G R 3y G R(x < y) 






(e) 


3y G K Vx G R(x < y) 






(0 


3ye [0,1] VxG [0,1] (a: 


<y) 




(g) 


Vx G R Vy G R[xy = 


->■ (x 


= 0Vy 


(h) 


Vx G R Vy G R 3z G R 


X + z 


= y] 


(i) 


3z G R Vx G R Vy G R 


X + z 


= y] 



□ 

Exercise A. 1.6 Rewrite the following sentences about numbers using logical notation. 



Sets, Logic and Functions 



139 



(a) The integer x is an even number. [Hint: An integer x is even if and only if there is an 
integer y such that x = 2y] 

(b) x is an odd number. 

(c) Any integer is either odd or even. 

(d) For any positive integer, there is another integer so that their sum is negative. 

(e) x is a rational number. 

(f) y/2 is an irrational number. 

□ 

Note that we have the following equivalence of statements: 

-i (yxip(x)*J <^=^ 3x(^ip(x)) 

For if it isn't the case that the statement <p{x) is true for every x, then there is at least one 
x for which the statement <p(x) is false, and thus for which -><p(x) is true. 

Exercise A. 1.7 Verduidelik waarom 

□ 

Thus a negation sign can "creep" past a quantifier, but it flips the quantifier in the process. 
For example, 

^[Vx3y(y > x)} <=^> 3x^[3y(y > x)] 
3xVy(y f x) 

One more thing: The variable x in a statement of the form \/x(p(x) or 3x<j)(x) is unimpor- 
tant, i.e. the meaning of the statement remains the same if we change the variable (provided 
that the new variable does not already occur in the statement <p). This is just like what 
happened for definite integrals: For example, we have 

[ b f(x)dx= f f(y)dy 

J a J a 

Just so, we have 

Vx0(x) <^=^ \/y4>{y) and 3x<fi(x) <^=^ 3ycf)(y) 



provided y does not already occur in cf>. 



140 



Sets, Functions and Relations 



A. 2 Sets, Functions and Relations 

The philosophical debate about the nature of mathematical objects was given a boost when 
it became generally accepted (in the early 20th century) that, in principle, all mathematical 
objects "should" be sets and mathematical notions "should" be expressible as relationships 
between sets. This means, for example, that y/2 is a set!! Actually, you mustn't take this too 
literally — What is meant is that set theory is flexible enough to interpret all mathematical 
objects as sets. We have mentioned before that the first satisfactory answers to the question 
"What is a real number?" were given independently, but nearly simultaneously (1872) by 
Cantor and Dedekind: 

• Cantor: A real number a is a certain set of sequences of rational numbers (namely the 
set of all such sequence that converge to a — but the definition can be phrased in such 
a way as to remove the circularity). 

• Dedekind: A real number a is the a certain set of rational numbers (namely the set 
{x £ Q : x < a}, but again the definition can be made non-circular). 

The point is that both these approaches construct an object — a complete ordered field — that 
behaves just like the real numbers. The ingredients in the construction are simpler objects, 
namely rational numbers. Both approaches provide a concrete construction of an object that 
behaves just like the the set of reals. And we do not really care what real numbers are, but 
only how they behave and interrelate. [In the same way, a chess player doesn't care what a 
chess piece is. Whether a piece is made of wood, or plastic, or appears on a computer screen 
is completely irrelevant. What matters to the chess player is how the piece behaves, i.e. how 
the rules (axioms) allow it to interact with other pieces on the board.] 

In the same way, almost any other mathematical object can be interpreted as a set, in 
some way or another. For this reason, every mathematician needs just a little set theory. The 
material in this section is not difficult, and no doubt you have seen it much of it before. 



Intuitively, a set is just a collection of objects. 



If A is a set and x is some mathematical object, we say that 
x £ A (x is an element of A) 

if x is amongst the objects collected in A, and we write 

x A 

if it isn't. 

The idea is that a set is characterized entirely by its elements. Thus if two sets A and B 
have exactly the same elements, then we must have A = B. For example, the sets A = {a} 
and B = {a, a} have the same elements, namely only a. Thus A = B. The fact that B seems 
to have two copies of a is immaterial. 

Here are some remarks for the philosophically minded: 

• Any definition involves some terms, and you can always ask for a definition of those terms. Those 
definitions will involve further terms, whose definition you can ask for. . . leading to either an infinite 
regress or circularity. We have to start somewhere, and we regard the notions of set an element as so 
basic that we need not define them: "We all know what is meant." 

And indeed, the idea of forming sets of objects is basic: If you want to count some objects, you first 
have to decide which objects you want to count, i.e. you first have to (mentally) put those objects in a 
set before you can count them. Forming sets is even more basic, therefore, than counting!! 



Sets, Logic and Functions 



141 



It is absolutely remarkable that starting with just this undefined notion we can rigorously develop almost 
all mathematics, and certainly all applied mathematics. 

• Saying that two sets are equal if and only if they have the same elements means, for example, that 

{Evening Star} = {Morning Star} 

as both sets are equal to the {planet Venus}. Yet the Evening Star is seen only in the evening, whereas 
the Morning Star is seen only in the morning. . . 

Instead of set, we will also sometimes say class, collection or family; instead of saying x is 
an element of A we will sometimes say x is a member of A or x belongs to A. 
There are two ways to represent sets: 

(i) By listing its elements, and 

(ii) By some denning property. 

For example, if a set A has finitely many elements a±, . . . , a n then it can be represented by 
A = {a±, ci2, ■ ■ ■ , a n }. On the other hand if A is the set of all x having a certain property 
P{x), then A can be denoted by A = {x : P(x)}. 
In analysis, the following sets are important: 

• The set of natural numbers N = {0, 1, 2, 3, ... } 

• The set of integers or whole numbers Z = {. . . , —2, —1, 0, 1, 2, ... } 

• The set of rational numbers Q = : n, m G Z, m ^ 0} 

• The set of real numbers R, and the set of non-negative real numbers is denoted by M + . 

• The set of complex numbers C = {a + ib : a, b G R} 

Example A. 2.1 • The set A of all integers between -1 and 3 can be represented in two 
ways: 

(i) A = {—1,0, 1,2,3} 

(ii) A = {n : n is an integer and — 1 < n < 3} 

• Q = { x g R : 3n G Z (nx G Z)} 

• {^2} = {i£l:i>0Ai 2 = 2}. 

□ 

A set need not have any elements: 



Definition A. 2. 2 We define the empty set to be the set with no members, and denote it 
by the symbol 0. 



The empty set plays roughly the same role in set theory that the number zero plays in 
ordinary mathematics. 



142 



Sets, Functions and Relations 




Figure A.l: Venn diagram illustrating inclusion A C B. 

Exercise A. 2. 3 In the above definition, we speak of "the" emptyset. Explain why there is 
only one empty set, and not many. To be more concrete, note that both the sets {x : x £ 
R and x 2 < 0} and {x : x / x} have no elements. Explain why they are the same set. 

□ 

Before we continue, please note the following common error: 

A^{A} 

e.g. 

The set on the left has no elements, whereas the set on the right has one element, namely 0. 
Definition A. 2. 4 We say that a set A is a subset of another set B, and write 

ACB 

if and only if every element of A is also an element of B. 

We say that A is a proper subset of B if A is subset of B, but A / B. 

We may also write B D A instead of A C B; they mean the same thing (just as x < y and 
y > x mean the same thing). 

Remarks A. 2. 5 Note that A = B if and only if A C B and BOA. 

□ 

Exercises A.2.6 (1) List all the subsets of the set A := {0, 1, 2, {1, 2}}, 

(2) Prove that is a subset of every set. 

[Hint: Give a proof by contradiction. Assume that there is a set A such that % A.] 

(3) Show formally that if A C B and if B C C, then ACC. 

(4) Prove (by induction or otherwise) that if a finite set A has n elements, then it has 2™ 
distinct subsets. 

□ 



Sets, Logic and Functions 



143 











ft 





Figure A. 2: Venn diagram illustrating intersection An B. 




Figure A. 3: Venn diagram illustrating union A U B. 
A. 2.1 Operations on sets 

There are several ways of combining sets to form new sets. In this section we define and 
give some examples of the set-operations union, intersection, difference, complementation, 
cartesian product and power set formation. 

Definition A. 2. 7 (Union, intersection and difference of two sets) 
Suppose that A, B are sets. 

(a) The union of A and B is the set of all elements which are either in A or in B (or 
both). 

AU B = {x : x £ AV x £ B} 

(b) The intersection of A and B is the set of all elements which belong to both A and B. 

At~)B = {x: x£Af\x£B} 

(c) The set difference of A and B is the set of all elements which belong to A, but not to 
B. 

A- B = {x: x £ AAx g B} 

Two sets A, B are said to be disjoint if they have no members in common, i.e. if An B = 0. 
In that case, A - B = A, B - A = B. 

Often we work within some universe, which is just the set of all objects under consideration 
at that time. The sets that we deal with are then typically subsets of the universe. 
Which set is the universe depends very much on context. If one is dealing with real numbers, the obvious 
choice of universe is K, but if one is dealing with complex numbers as well, then it would be C. If one is 
trying to find the solution of an n th order differential equation, then the universe will generally be the set of 
all n-times differentiable functions. In probability theory, the sample space Q acts as universe. 



144 



Sets, Functions and Relations 









J 



Figure A. 4: Venn diagram illustrating set difference A — B. 




Figure A. 5: Venn diagram illustrating complementation A c . 



Given a universe, we also have a unary operation on sets, called complementation. 



Definition A. 2. 8 Let the universe be fi, and let ACQ. The complement of A is the set 
of all elements in the universe which are not in A. 

A c = {xen:x^A} 



Note that A c = tt - A. Also note that A - B = An B c . 
Here are some standard identities involving the operations: 



Sets, Logic and Functions 



145 



Proposition A. 2. 9 Suppose that A,B,C are subsets of some universe Q. 


(a) Idempotent laws: 




A U A = A; 


AC\A = A 


(b) Commutative laws: 




A U B = B U A; 


AnB = BnA 


(c) Associative laws: 




(AL)B)UC = ALi(BL)C); 


(A n B) n C = A n (B n c) 


(d) Distributive laws: 




A n (B U C) = (A n B) U {A n C); 


AU {B n C) = (A\J B) n (AU C) 


(e) Absorption laws: 




A U (A n B) = A; 


An(AuB) = A 


(f) Complementation laws: 




a u a c = n ; 


AnA c = (& 


{A c ) c 


= A 


(g) De Morgan's laws: 




(A n B) c = A c U B c ; 


(AUB) C = A C <1B C 



Note that each of the identities remains true if 



• n and U are interchanged, and 

• and Q are interchanged. 

Proof: We show how to prove one of the above laws, and leave the remainder as exercises. 
Let us prove that A n (B U C) = {A n B) U (A n C). 

First suppose that x G A n (5 U (7). Then x £ A and x e BUC, by definition of n. Thus 
x £ A and either (1) x £ B, or (2) x € C (or both), by definition of U. Thus either (1) x €. A 
and x G B, or (2) x G A and x G C. It follows that either (1) x G A n -B or (2) x G ^4 n C, 
and thus that x G (AnBjUfylnC). We have now shown that if x G A n (5 U C), then also 
x G (v4 n 5) U (A n C), i.e. that 

An(BuC) c (AnB)u(AnC) (*) 

Next, assume that x G (A n B) U (A n C). Then either (1) x G A D or (2) x 6 4 n C. 
In either case, it follows that x £ A. Also we must have either (1) x £ B, or (2) x G C, 
and thus i 6 5 U C. We see, therefore, that we have both x £ A and x £ B U C, so that 
iGAn(BUC). It follows that whenever x G (AnB) Li (An C), then also x £ An (B L) C), 
i.e. that 

(AnB)u(inC)cin(BuC) (t) 

Putting (*) and (f) together, we obtain 

An(BuC) = (An B)u (AnC) 



as required. 



146 



Sets, Functions and Relations 



H 

Exercise A. 2. 10 Prove the remaining identities in the proposition above. 

(By the way, drawing a Venn diagram does not constitute a proof! Venn diagrams are drawings in the plane, 
and are reliable only when you are dealing with quite a small number of sets.) 

□ 

A set is completely determined by its elements. The order in which those elements are 
arranged does not matter. For example, {a, b} = {b, a}. When we want the order to matter, 
we have to deal with ordered tuples. An ordered pair is denoted by (a, 6), and should be 
thought of as a collection containing a and b, in that order. Thus (a, b) / (6, a). Note that 

(a, b) = (c, d) <^=^ a = c and b = d 

Generally, an ordered n-tuple is denoted by (a±,a2, ■ ■ ■ ,a n ), and should be thought of as a 
collection containing a±, a,2, ■ ■ ■ , a n , in that order. 

The pair (a, 6) is usually defined to be the set {{a}, {a, &}}. You can check that this definition yields the 
required property that (a, b) — (c, d) iff a = c and b = d. 

(a,b,c) is then defined to be (a, (b, c)) (which is just the set {{a}, {a, {{£>}, {b, c}}}}), etc. This is in keeping 
with the notion that all mathematical objects should be sets. On first encounter, however, you might find this 
arbitrary, clumsy, and unnecessary, and you wouldn't be far wrong: The main thing that you need to keep in 
mind is that an ordered tuple is a collection in which the order matters. 

Using ordered tuples, we can define one more way of making new sets from old: 



Definition A. 2. 11 (Cartesian product) Suppose that A±,A2, ■ ■ ■ , A n are sets. The carte- 
sian product of Ai, . . . , A n is the set of all n-tuples (ai, . . . , a n ), with each au £ A^. 

Ai x A 2 x • • • x A n = {(ai,a 2 , . . . , a n ) : a k G A k for A; = 1, 2, . . . , n} 



We will identify the sets (A x B) x C and Ax (B xC) with Ax B xC, although, strictly 
speaking, they are not equal. 

For example, ((a, 6), c)) is an element of the first set, but not of the second or third, (a, (6, c)) belongs to the 
second, but not to the first or third, (a, b, c) belongs to the third, but not to the first two. However, we shall 
simply identify (a, (b, c)), ((a, b),c) and (a,b,c), i.e. we shall not distinguish between them. After all, all that 
matters is the order of a, b, c and that is the same in each of these tuples. 

Thus far, we have considered union, intersection and cartesian product as binary opera- 
tions, involving just two sets. Frequently, however, we may need to consider these as infinitary 
operations: We can, for example, take the union of infinitely many sets. We define the union, 
intersection and cartesian product of a family of sets as follows: 



Sets, Logic and Functions 



147 



Definition A. 2. 12 (Union, intersection and product of a family of sets) 
If A = {Ai : i e 1} is a family of sets, we may define 

(a) the union 

\^jA=\^jAi = {x:x£Ai for some i £ 1} 
iei 

(b) the intersection 

P| A = P) Ai = {x : x £ Ai for all i £ 1} 

iei 

(c) the cartesian product 

Y[ A = U Ai = : ai G Ai for a11 i G 7 ^ 

ie/ 

Here (cn)i is a generalized tuple, indexed by /. 

In essence, (a,)/ is a function with domain / and range (J A;. We will return to this later. 

iei 

oo 

We will frequently write \J Ai or |J i Ai instead of \J Ai. We will also write |J A n instead 

/ iei n=l 

of |J A n . The same holds for f] and FJ. 

Remarks A. 2. 13 Note that 

(i) U{A,B} = AUB 

(ii) H{A,B,C} = AxBxC 

(ih) n^i.-^a, ...,x n } = x 1 nx 2 n---nx n 

etc. 

□ 

Exercise A. 2. 14 Let A±,A2,As, ... , A n , ... be a sequence of subsets of a fixed set Q. For 
x 6 0, we say that 

x G ^4 n eventually (ev.) 

if x belongs to all the from some point onwards, i.e. if there exists an N £ N such that 
x £ A n for all n> N. (Then x belongs to all the A n from N onwards.) Let (A n ,ev) denote 
the set of all x such that x belongs to A n eventually, i.e. 

(A n ,ev.) := {x £ Q : x £ A n , ev.} 



Similarly, we say that 

x £ A n infinitely often (i.o.) 

if x belongs to infinitely many of the sets A n , or, more accurately, if there are infinitely many 
n £ N such that x £ A n . Let (yl n ,i.o.) denote the set of all x such that x belongs to A n 
infinitely often, i.e. 

(A n ,i.o.) = {x £ SI : x £ A n ,i.o.} 



148 



Sets, Functions and Relations 



(a) Explain why (A n ,ev.) C (A n ,i.o.). 

(b) Explain why the following is true: 

x G A n , ev. 3iV G NVn > 7V(x G A n ) 

(c) Explain why we may express (A n ,ev.) as follows: 

(A,ev.) = [j f] A n 

NeNn>N 

[Hint: Try to understand the following reasoning: For N G N, define .Bat := f] n>N A n = 
A N n A N+ i n n . . . . Then x G UjveN DnXTV A iff x G UjveN ^ iff tnere is some 

N such that x G B N .(Why?) Now x G B N iff x G A n for each n> N. (Why?)] 

(d) Explain why the following is true: 

x e An, i.o ViV G N3n > N(x G A n ) 

[Hint: If x G A n i.o., then for each N, there must be n > N such that x G A^. For if 
this were not so, then there would be some N such that x yl n for any n > N. But then 
x can belong only to those A n for n G {1, 2, . . . , N — 1}, i.e. to only finitely many of the 

(e) Explain why we may express (A n ,i.o.) as follows: 

(A n ,i.o.) = P| [j A n 

N&ln>N 

(f) Explain why 

(A*. ev.) c = i.o.) and (A n , i.o.) c = (^, ev.) c 
Do this in two ways: via logic, and via set theory. 

□ 

Here is another way of making new sets from old: Given a particular set, one should be 
able to collect all of its subsets together into a new set, called the power set. 

Definition A. 2. 15 (Power set) 

If A is a set, then the power set of A is the set of all subsets of A. 

V(A) = {B : B OA} 



Note that 0, A G V(A). They are, respectively, its smallest and biggest members. 



Sets, Logic and Functions 



149 



A. 2. 2 Functions 

Originally, a function was regarded as a rule (or a formula, or an algorithm) for associating 
one real number with another. For example, 

f(x) = 2x 3 

explicitly shows how to calculate a number f(x) which is to be associated with x: First cube 
x, and then multiply the resultant by 2. However, this original formulation proved to be 
unduly restrictive. For one thing, Fourier showed that practically any continuous curve of 
finite length could be give a "formula" as an infinite trigonometric series. For another, we 
may want to associate numbers with other mathematical objects, or one kind of mathematical 
object with another — there is no reason to restrict ourselves solely to numbers. 

For example, we may want to associate with each rectangle its area. Thus we have a function which assigns a 
number to each rectangle. 

Or, we may want to assign to each subset of R its power set. This yields a function which assigns a set to each 
set. 

Thus a general definition of function dispenses with the idea that it is a rule, but keeps 
the idea of associating one object with another: 

Definition A. 2. 16 Let A, B be sets. A function (or map) f from A to B, written 

f :A^B or A -U B 
is a subset of the cartesian product Ax B with the following property: 

for each a £ A there exists exactly one b £ B such that (a, b) G / 
In that case write 

f(a) = b instead of (a, b) G / 

We call b the image (or value) of a under /, and call a a preimage of b. We also say that 
a maps to b under /. 

The set A is called the domain of /, and the set B is called the codomain of / 

A = dom(/) B = codom(/) 
The range of / is the set of all possible values of /, and denoted ran(/). 

Essentially, this concept of function is arrived at by deliberately confusing a function with 
its graph. For example, the graph of the function / : R — > R : x >->■ 2x 3 is a curve in the 
cartesian plane. This curve is therefore a set of ordered pairs: 

Graph(/) = {(x,y) : y = 2x 3 } 

For example, the points (0, 0), (1, 2), (2, 16), (3, 54) belong to the graph. Now we assert that 
a function is its graph. Thus the function f(x) = 2x 3 is nothing but the set {(x,y) : y = 
2x 3 } Clxi 



150 



Sets, Functions and Relations 



Examples A. 2. 17 You've already met more than just a few functions in your mathematical 
education up to date. The most obvious ones are functions from R n to R m , such as f(x) = 
x 2 ,g(x, y) = sin(x 3 + y), h(x, y, z) = (xy, xlnz), etc. Here are a few more that you might not 
yet have considered as functions: 

(a) Define Z — > 'P(Z) by: f(n) = {m : m divides n}. Then / is a function which maps a 
number to a set. For example, 

/(12) = {±1, ±2, ±3, ±4, ±6, ±12} = /(-12) 

(b) Let C (M,R) = {/ : / is a continuous map from R to R}, and let a < b G R. Then 

: C°(R, R) — > R is a function which assigns to every continuous map its definite 
integral. 

(c) Let C^MjM) be the set of all maps from R to R which have continuous first derivatives. 
Then the derivative operator is a map D : C X (R,R) — > C°(R,R). 



(d) curl is a map from the set of vector fields on 
vector fields on R 3 to the set of functions on 
differentiable functions R 3 — > R to the set of vector fields on 



3 to itself, div is a map from the set of 
: 3 — > R. grad is a map from the set of 



(e) An n x m matrix A can be regarded as a map from A : R m — > R n . 

(f) Addition and multiplication are functions from R 2 to R. Addition can, in fact, be de- 
scribed by the 1 x 2-matrix (1 1), for (1 1)(£) = a + b. 

(g) If fi is a universal set, then union and intersection can be regarded as functions from 
V(Vi) x V(Vt) to V{Vt), which map the ordered pair (A, B) to AuB and An B respectively. 

(h) We can also regard the bigger version |J of union as a map, but this time we have 
U : V(V(fl)) — > V(fl). It assigns to any family of subsets of O its union. (Note that a 
family of subsets of $7 is just a set of elements of V(Q), i.e. it is a subset of V(Q), and 
therefore an element of V(V(Q)).) The same goes for intersection. 

□ 

For any set A, there is an important function on A called the identity function. It is 
denoted by id^, and is defined by 

id/i : A — > A id^(a) = a 

Thus id A = {(a, a) : a G A}. 

Examples A. 2. 18 (a) The identity function on R is just the function y = x. 
(b) The identity function on R ra is the identity matrix 



/ 1 

1 

1 

V 



o \ 



1 J 



Sets, Logic and Functions 



151 



□ 

Definition A. 2. 19 Let / : A — > B. If A' C A, we can define the restriction of / to A' as 
follows: 

/| A' is a map from A' to B, such that (/|A')(a) = f(a) for all a G A' 



Definition A. 2. 20 Let A 5 be a function. 

(a) / is said to be one-to-one (or 1-1, or injective) if and only if the following condition 
holds: 

If f(ai) = f(a 2 ), then oi = a 2 . 

(b) / is said to be onio (or surjective) if and only if 

For every b G B there exists an a G ^4 such that /(a) = 6. 

(c) / is said to be a bijection (or a one-to-one correspondence) if it is both an injection 
and a surjection. 

Remarks A. 2. 21 A function / : A — > B is injective if no two distinct members of a map to 
the same b G B, i.e. if every b £ B has a£ most one preimage. 

/ is surjective if and only if every b in B gets mapped onto by some a £ A, i.e. if every b G B 
has at least one preimage. In that case B is the range of /, i.e. ran(/) = codom(/). 
/ is a bijection if and only if every b G B has exactly one preimage. 

It should be clear that there is a bijection from a finite set A to another set B if and only 
if A and B have the same number of elements. 

□ 

Examples A. 2. 22 (a) Let f(x) = x 2 . We would generally regard / as a function with 
domain M and codomain M. The range of / is [0, +oo), since / takes no negative values. 
/ is not injective, because, for example /(l) = /(— 1). / is not surjective either, since — 1 
is not in the range of /. 

(b) If we define g(x) : [0, 1] — > [0, 1] by g(x) = x 2 , then we may regard g as the restriction 
of / to [0, 1], i.e. g = f\[0, 1]. Now g is clearly a bijection. 

(c) x 3 : R — > R is a bijection. 

(d) Let Q + denote the set of all non-negative rational numbers. The map h : Z x N — > Q + 
defined by h(n, m) = ^ is surjective, but not injective. 

(e) If A C B, then the inclusion / : A — > B defined by: f(a) = a is an injection. It is a 
bijection if and only if A = B. 

(f) Let A be an n x n-matrix, regarded as a map from W 1 to W 1 . Then A is injective if and 
only if det(A) / 0. 

□ 

Next, we discuss how functions can be combined: 



152 



Sets, Functions and Relations 



Definition A. 2. 23 If / : A — > B and g : B — > C, then g o / is a function from A to C, 
denned by 

(gof)(a)=g(f(a)) 



Note that the composition g o f does in one step what / and g do in two: 



A^BAC a A f(a)Ag(f(a)) 
Also note that go f means: 



A^C a n g {f{a)) 



Do / first, then g 

i.e. the last shall be first. 

An often used fact is that composition is an associative operation on functions, i.e. 

h o (g o /) = (h o g) o / 

By this equation we mean that: one side is defined if and only if the other side is defined, 
and in that case they are equal. 

For if A B, B C, and C — )■ D, then h o (g o f) is a function from A to D which 
works as follows: First do g o /, then do h. But to do g o /, you must first do /, then g. The 
combined result is 

First do /, then g, and then h: (ho (g o f))(a) = h(g(f(a))) 

Similarly, (h o g) o f is a function from A to D which works as follows: First do /, then hog. 
But to do h o g, you must first do g, then h. The combined result is therefore 

First do /, then g, and then h: ((h o g) o f)(a) = h(g(f(a))) 

and thus h o (g o /) = (h o g) o /, as claimed. 

Example A. 2. 24 Consider the following functions (note their domains and codomains): 

R -A R + : x !->• x 2 + 1 

R+ A [-1,1] : z^sm(z) 



Then 

R ^4 M+ : x ^ 
K +1 ^[-l,l]:y^sin(Vy) 

and thus 

R h °H f) [-l,l]:x^sin(^+l) 
R (M^/ [ _ 1)1]:x ^ sin( ^^T ) 

□ 



Sets, Logic and Functions 



153 



Exercises A.2.25 (1) Let / : N ->• N : n n 2 , and let g : N ->• N : n ->• n + 2. Calculate 
(/o 5 )(5) and 5 o/(5). 
Write down formulas for / o g and 50/. 

(2) Suppose that f(x) = x 2 and g(x) = x + 3. Calculate g o /(x) and / o g(x). Note that 

g° f ^ f°g- 

(3) If A is an n x m-matrix, and B is an m x r-matrix, then we can regard them as functions 

E m 4r,E r 4l m . The composition AoB\s therefore a map W ->• M n . It is not hard 
to show that the composition is just the matrix product, i.e. that Ao B = AB. Do so! 

(4) Suppose that go f\ = go / 2 . Prove that if g is injective then we can "cancel" g to conclude 
fi = fi- Give an example to show that left-cancellation may fail if g is not injective. 

(5) Suppose that g\ o / = g2 o f. Prove that if / is surjective then we can "cancel" / to obtain 
gi = 92- Show that right-cancellation may fail if / is not surjective. 

□ 



Note that if / : A — > B, then / o id^ = /, and id# o / = /. Thus the identity function 
behaves like an identity element for the operation of composition. 

The number is an identity element for the operation of addition, because x + = x. 
The number 1 is an identity element for the operation of multiplication, because x • 1 = x. 

Next, we tackle the idea of inverting (or reversing) the effect of a function. Take the func- 
tion f{x) = 3x. It transforms the number x into the number 3x. To undo this transformation, 
you just multiply 3x by |. The function g(x) = |x inverts the effect of /, in that 

9°f(x)=x fog(y)=y 

Thus applying first /, and then g gets you back to the starting point x. The same holds true 
if you apply g first, and then /. 

Can every function be inverted? No, as is easy to see: Consider the function f(x) = x 2 . 
Then /(2) = 4 = /(— 2). Now if g is a function which reverses the effect of /, then we cannot 
decide whether g(4) = 2 or g (4) = —2. The problem arises because g is not 1-1. 

Let's make the preceding discussion precise: 

Definition A. 2. 26 Let / : A — > B. We say that / is invertible if and only if there is a 
function g : B — > A such that 

g(f(a)) = a for all a G A, f(g(b)) = b for all b G B (*) 

The function g, if it exists, is called the inverse of /, and denoted g = Then (*) 

amounts to saying 

F 1 / = id^ and / o f' 1 = id B 
Note that if J" 1 exists, then 

= a if and only if f(a) = b 
Proposition A. 2. 27 A function f : A — > B is invertible if and only if it is a bijection. 



154 



Sets, Functions and Relations 



Proof: Suppose that / is invertible, i.e. that / exists. Then / is a function from B to 
A. We first show that / is surjective: Let b G B. Since the domain is B, f~ l (b) must be 
defined, i.e. there must be some a G A such that = a. But then /(a) = b. Hence every 

b G B has a preimage. 

Next we show that / is injective. For suppose that /(ai) = f{a2) = b. Then = ai and 

= 02- Since / _1 is a function, we must have a\ = 02 (check the definition of function), 
and hence / is injective. 

This proves that if / is invertible, then / is a bijection. 

Now we prove the converse. If / is a bijection, then it is onto B. Hence for every b G B 
there is some a £ A such that /(a) = b. Moreover, since / is one-to-one, that a has to be 
unique. So we may define to be the unique a such that f(a) = b. This makes f^ 1 into 

a well-defined function f^ 1 : B — >■ A. 

H 

Examples A. 2. 28 (a) The function f(x) = x 3 is a bijection on the reals, and its inverse is 
g(x) = tyx. 

(b) The function f(x) = x 2 does not have an inverse, since it is not a bijection. However, if 
we restrict f to the non-negative reals, then is a bijection. Its inverse is the square 
root function. 

(c) The function / : R — > (0, +00) defined by f(x) = e x is bijective. Its inverse is the 
natural logarithm lnx. 

(d) The function sinx is neither injective, nor surjective; however, if we restrict sin a; and 
regard it as a function [— |, |] — > [—1, 1], then it is a bijection, and its inverse is arcsinx. 

(e) If A is an n x n-matrix, regarded as a function on M ra , then A has an inverse function if 
and only if A has an inverse matrix. Since composition is just matrix multiplication, the 
inverse function of A is just the inverse matrix A^ 1 . 

□ 

Remarks A. 2. 29 Note that, in general, 
e.g. %fx + 4j. 

The number x^ 1 — \ is the inverse of x under the operation of multiplication, in that 

-I-, -1 , 

x ■ x =1 x ■ x — 1 

noting that 1 is the identity for multiplication. 

The function /~ is the inverse of / under the operation of composition, in that 

/°r X =id /- 1 o/ = id 
noting that id is the identity for composition. 

The same notation for inverse, i.e. _1 , refers to different operations, so there's no reason to believe that there 
is any relationship between them. 

□ 



Sets, Logic and Functions 



155 



The notion of invertibility can be refined: 
Definition A. 2. 30 Let / : A ->• B and g : B ->• A. 

(a) </ is called a Ze/t inverse of / if o / = id^. 

(b) g is called a h<?/j£ inverse of / if / o g = id#. 

Note that if / is invertible, then / _1 is both a left and a right inverse of /, and vice versa. 
Exercises A. 2. 31 (1) Prove that a function / has a left inverse if and only if it is injective. 

(2) Prove that a function / has a right inverse if and only if it is surjective. 

(3) Prove that if a function / has a left inverse g and a right inverse h, then / is invertible, 
and g = h. 

(4) Consider / : {a,b,c} ->• {1,2} defined by /(a) = /(&) = l,/(c) = 2. Find two distinct 
right inverses of /. 

(5) Consider the inclusion i : 7h — > Q. Construct two distinct left inverses of t. 

□ 

A. 2.3 Functions Operating On Sets 

We have already noted the confusion that may possibly arise by the two uses of the symbol 
-1 . We have but few symbols at our disposal, and many of them must therefore serve more 
than one function. Thus you must always be aware of the context in which a particular symbol 
is used. 

You have to do this when using ordinary language: You know in what sense the newspaper headline 

"School kids make great snacks at fund raiser" 
is meant, even though the other sense offers greater amusement value. 

I say this because we are about to add to the possible confusion. With every function 
/ : A — > B (not necessarily invertible) , we can associate two new functions between the power 
sets of A and B 

/[•] : V{A) -»• V(B) :A'^{b£B: There is a' G A' such that f(a') = b} where A' C A 
: T(B) -> T(A) :B'^{aeA: f(a) G B'} where B' C B 

Thus /[•] assigns to each subset A 1 of A a subset f[A'] C 5. Similarly, transforms each 

subset £' of B into a subset /^[B'} C A 

We will, for the moment, use square brackets to distinguish the various functions, but will 
drop this convention later. Which function is meant will be clear from context. We shall also 
call f[A'\ the direct image of A 1 along /, and f~ l [B'} the inverse image of B' along /. Note 
that 

f[A'] = set of all images of a G A' 

whereas 

= set of all preimages of b G B' 



156 



Sets, Functions and Relations 



Remarks A. 2. 32 Sometimes the notation f~* is used for direct image, and for inverse image. 

□ 

Inverse images play a very important role in mathematics. It is therefore useful to re- 
member the following: 

a G f- x [&] if and only if f(a) G B' 

Similarly, 

b G f[A'] if and only if there is a' G A' such that f(a f ) =b 
Examples A. 2. 33 (a) Suppose that / : R ->• R : x ^ x 2 . Then 

/[-1,2] = [0,4] ,f[Z] = {0,1,4,9,...}, /[{4}] = {16} 

Also 

r l [o,i] = [-i,i], r l [{4}] = {2,- 2 ,}, r i [{- 4 }] = 

In each case, a set is transformed into a set 

(b) Suppose that A = {a±, ci2, a 3 }, B = {61,62,^3}, and that / : A — > B is defined by 
f(ai) = f(a 3 ) = 61, and /(a 2 ) = 63- Then 

f[{ai}]=f[{a 3 }] = f[{a 1 ,a 3 }} = {b 1 }, f[{a 2 }]=b 3 , f[A] = {b 1 ,b 3 }, /[0] = 

and 

r 1 KM] = {« 2 }, r 1 [{6 2 }] = r 1 [0] = 0, r 1 ( J B) = r 1 [{6 1 ,& 3 }] = A 

□ 

Exercises A. 2. 34 1. Let / : A ->• 5 be a function, and let A' C A, B' C 5. 

(a) Show that A' C 

(b) Show that 5' 3 /LTM- 6 ']] 

(c) Show that A' = /^[/[A 1 ]] for every A' if and only if / is injective. 

(d) Show that B' = for every B' if and only if / is surjective. 

[Hints: Reason along the following lines: 

(b) If b G flf-^B')} then b = f{a) for some a G But then /(a) G and so 
& G B'. 

(c) If a G then f( a ) 6 /[^']- Thus there is a' 6 ^' such that f( a ) = /(«')■ But 
since / is injective, a = a', and so a G A'.] 

2. Inverse images preserve the set operations: Let / : A — >■ B, and suppose that G, H are 
subsets of B. Then 

(a) If G C then C 

(b) r 1 [Gn J ff] = /- 1 [G]n/- 1 [i?]; 

(c) f-i[GUH] = f-i[G]Uf- 1 [H]; 

(d) /- 1 [G-fl] = /- 1 [G]-r 1 [^]; 



Sets, Logic and Functions 



157 



3. Direct images are not quite so well behaved: Let f : A^ B, and suppose that G,H C A. 

(a) Suppose that G C H. Show that f[G] C /[#]; 

(b) Show that /[G U tf] = f[G] U /[#]; 

(c) Show that f[G DH]C f[G] D /[#]; 

(d) Give an example to show that we may not have /[G n i7] = f[G] n /[-H]; 

(e) Show that /[G] - /[#] C f[G - H] C /[G]; 

(f) Give an example to show, in (e), that both C's may fail to be ='s. 

□ 

We end this section with some notation: Suppose that A, B are finite sets, and that A 
has n elements, and B m elements. How many functions are there from A to B1 
For each a G A we have m choices for the value f(a) G B. Thus there are m n functions from 
AtoB. For that reason 

Definition A. 2. 35 Let A, B be sets. Then we define 

B A = set of all functions from A to B 
Some authors use A B instead of B A . 

□ 

Note that each function / : A — > B is a subset of A x B. Hence B A is a set of subsets of 
Ax B, i.e. B A G P(P(y4 x B)). 

A. 2.4 Relations 

We want to capture mathematically the idea that two objects are somehow related. For 
example, suppose that we have two sets 

M = {Archie, Reggie, Forsythe} W = {Betty Veronica, Ethel} 

and suppose that A is married to B, and that R is married to V, but that F and E remain 
unmarried. The relation of being married is described by the set 

R = {(A,B),(R,V)} 

Note that R is a subset of the cartesian product M x W. We will sometimes write x~Ry 
instead of (x, y) G R. Thus in this case, xHy if and only if x is married to y. 
As for functions, the general definition of a relation is quite abstract: 



Definition A. 2. 36 A relation from a set A to a set B is just a subset of A x B. If A = B, 
we just say that R is a relation on A. 



158 



Sets, Functions and Relations 



Thus if A = N and B = N U {0}, then 

L = {(1,3), (2, 1), (3,4), (4, 1), (5, 5), (6, 9) . . . } C A x B 
is a relation from A to B. Here what the relation actually is may not be obvious. Could 
you have guessed that 7L2 and 8L6? In fact, riLm if and only if m is the n th number in the 
decimal expansion of it = 3.14159265 .... Since there may often be a relation without you 
being able to see it, we have adopted a completely general definition of relation, which does 
not assume any visible relationship between the objects. 

Note that function is a special kind of relation: We defined a function / : A x B to be a 
subset of A x B with the additional property that for every a there is exactly one b such that 
a f b. Instead of writing afb, however, we write f{a) = b. 

There are two important classes of relations in mathematics, namely 
equivalence relations and partial orderings. Equivalence relations have many of the same 
properties of =, and partial orderings have similar properties to < and C. 



Definition A. 2. 37 Suppose that R is a relation from on a set A. R is said to be 

(i) reflexive if aRa for all a £ A. 

(ii) symmetric if aR6 implies 6Ra for all a, b G A. 

(hi) antisymmetric if aUb and 6Ra together imply a = b, for all a, b G A. 

(iv) transitive if aUb and bUc together imply aRc, for all a, b, c G A. 

An equivalence relation is a reflexive, symmetric, transitive relation. 
A partial ordering is a reflexive, antisymmetric, transitive relation. 



Let's take a look at equivalence relations from another angle: They are very closely related 
to partitions. 



Definition A. 2. 38 Let A be a set. A family A = {Ai : i G /} is called a partition of A 

provided that 

(i) The Ai are mutually disjoint, i.e. if i ^ j, then A; L n Aj = for all i, j G /. 

(ii) [JjA t = A 

Thus {Ai : i G 1} is a partition of A provided that every element of A belongs to exactly 
one Ai. If {Ai : i G /} is a partition of A, then we can define an equivalence relation on A 
by: 

a b <^=^ a, b belong to the same Ai 
Exercise A. 2. 39 Prove that « is an equivalence relation. 

□ 

On the other hand, if ~ is an equivalence relation on A, then ~ behave roughly like =. 
When we lump together all elements that are the same under «, we get an equivalence class. 



Sets and Logic 



159 



Definition A. 2. 40 Let be an equivalence relation on A. For each a £ A, define the 
equivalence class E(a) of a as follows: 

E(a) = {be A:a^b} 

Note that E{a) = E(b) if and only if a « b. If a 96 b, then E{a) n £(6) = 0. Thus the sets 
-E(a) are either equal or disjoint. Hence the set {E{a) : a € A} is a partition of A 

A. 3 Countable and Uncountable Sets 

In this section, we investigate the idea of the size or cardinality of a set. For finite sets, we 
can determine the size of a set by counting its elements. Thus for example, the set {a, b, c} 
has cardinality 3 (it has 3 elements). We are going to extend this idea of counting to obtain 
the size to infinite sets, and we will show that infinity comes in many sizes. 

Let's explore the idea of counting: For the moment, let n = {1, 2, . . . , n} be the set of the 
first n natural numbers. To say that A = {a, b, c} has 3 elements is equivalent to saying that 
there is a one-to-one correspondence between the sets A and 3. Indeed, this is the heart of 
the idea of counting: When we count the elements of A, we are setting up a bijection between 
A and 3. We go "a first, b second, c third". This is equivalent to a map / : A = 3 defined by 
f(a) = 1, f(b) = 2, f(c) = 3. Thus the idea of counting the elements of a finite set X involves 
finding a bijection between X and some n. If there is a bijection from X to n, then X has n 
elements. 

It is obvious that two finite sets A and A have the same size if and only if there is a 
one-to-one correspondence / : A = A. We don't even have to count A and A to know that 
they have the same number of elements. If A = {a,b,c,d} and A = {a, f3, 7, <5}, then the 
existence of the bijection / : A = A given by 

f(a) = P, f(b) = 5, /(c) = q, /(d) = 7 

is sufficient to show that A and A have the same number of elements. It doesn't tell us that 
this number is 4. 

Thus two sets have the same size if and only if there is a bijection between them; we can 
bypass the idea of number. This is important, because we cannot actually count infinite sets. 
But we can establish bijective correspondences between infinite sets. We shall adopt this idea 
as our basic idea of size. 



Definition A. 3.1 We define an equivalence relation ~ between sets as follows: If A, B 
are sets, we say that A ~ B if and only if there is a bijection from A to B. If A ~ B, 
we say that A and B have the same cardinality. We may also indicate this by saying 
\A\ = \B\. 

Note that having the same cardinality is an equivalence relation between sets, i.e. that 

(i) \A\ = \A\ (Reflexivity) 

(ii) If \A\ = \B\, then \B\ = \A\ (Symmetry) 

(iii) If \A\ = \B\ and \B\ = |C|, then \A\ = \C\ (Transitivity) 



160 



Countable and Uncountable Sets 



Exercise A. 3. 2 Prove this assertion. (Note that the assertion is not obvious: When we say 
that | A | = \B\, we are not actually claiming that there are two equal numbers. What we are 
saying is that there is a bijection from A to B. To prove (i), for example, you have to find a 
bijection from A to A) 

□ 

Examples A. 3. 3 (a) Two finite sets have the same cardinality if and only if they have the 
same number of elements. 

(b) For finite sets, if A is a proper subset of B, then \ A\ < \B\. This breaks down completely 
for infinite sets. Consider, for example, the sets N and Z. It is certainly true that NcZ. 

However, the map N — > Z defined by 




if n is even 
if n is odd 



is a bijection: /(l) = 0,/(2) = l,/(3) = -l,/(4) = 2,/(5) = -2,/(6) = 3.... (Note 
that we are zig-zagging from the positive integers to the negative integers.) Thus N and 
Z have the same cardinality, even though N seems to contain fewer elements than Z. 

(c) We also have |Q| = |N|. This can be seen as follows. Put the set of strictly positive 
rational numbers Q + in an array 

1/1 2/1 3/1 4/1 5/1 ... 

1/2 2/2 3/2 4/2 5/2 ... 

1/3 2/3 3/3 4/3 5/3 ... 

1/4 2/4 3/4 4/4 5/4 ... 

1/5 2/5 3/5 4/5 5/5 ... 

We can then trace a zig-zag path that moves through all the rational numbers as follows. 
Start at the top line and move diagonally down to the left until you reach the leftmost 
line. Repeat. We thus obtain a sequence 

1 2 1 3 2 1 4 3 2 1 5 

1' 1' 2' 1' 2' 3' T' 2' 3' 4' 1"' 

All of the strictly positive rational numbers occur in this sequence, and they all occur 
infinitely many times. For example, j, |, | . . . lie along the diagonal, and they are all 
equal. To obtain a bijection from N to Q + , we follow the above sequence of rationals, but 
we omit any number that has already occurred to ensure that the function is one-to-one, 

i.e. we prune away the repeated values. We therefore define the function N — > Q + by 

/(I) = \, /(2) = \ / (3) = \ /(4) = p /(5) = 1 / (6) = \, . . . 

Note that /(5) 7^ |, which is after /(4) = | in the sequence, because \ = \ has already 
occurred as /(!)■ Then / is a bijection from N to Q + . Now even though we haven't 



Sets and Logic 



161 



found a formula for /, it is nevertheless a perfectly good function, and all its values can 
be calculated. Can you see that /(16) = I? 

In the same way, we can set up a bijection g from N to the negative rationals. Just put 
g(n) = —f(n). Finally, we can define a bijection h : N — > Q using /, g and another 
zig-zag: We define 

Ml) = 0, h(2) = /(l), M3) = 9(1), M4) = /(2), 
M5)= 5 (2),M6) = /(3), h(7) = g(3),... 

Again, we have no formula for h, but it is certainly a well-defined function, and all its 
values can be calculated. Check that h(23) = — f. 

□ 



Definition A. 3. 4 A set A is said to be countable if there is a surjection from N onto A. 



Remarks A. 3. 5 (a) Basically a set A is countable if its elements can be indexed by the 
natural numbers, i.e. if it can be written as A = {a n : n £ N}. For if A is countable and 

not finite, then there is a bijection N — > A, and we can take a n = f(n). Conversely, if 
A = {a n : n £ N} is infinite, we can define a bijection from N to A by letting f(n) = 
a n (although here some pruning is necessary if the a n aren't all distinct; see Example 
He)). 



(b) A set A is countable if and only if it is either finite or can be put into a one-to-one 
correspondence with the natural numbers, i.e. if \ A\ = n for some n £ N, or \A\ = |N|. 



(c) In Example A. 3. 3, we proved that the sets Z and Q are countable sets. 



(d) The "zig-zag" technique, used above to prove that the rational numbers are countable, 
is often very useful. 

□ 

Exercise A. 3. 6 Prove that the union of countably many countable sets is countable (i.e. 
prove that if A n (n £ N) are countable sets, then the set IJ ngN A n is countable as well.) 
[Hint:Zig-zag!] 

□ 

So all the infinite sets we've seen so far are countable (and the finite ones also, of course). A 
very natural question that might occur to you is the following: Are all infinite sets countable? 
The answer is "No!" 

Example A. 3. 7 We show that the unit interval I = [0, 1] is uncountable, i.e. that we cannot 
find an enumeration 

/ = {x n : n £ N} 

The proof is by contradiction: Suppose that we can find such an enumeration / = {x\, X2, £3, £4, • • • }, 
i.e that every real number in [0, 1] is equal to x n for some n. Now every number x n has a 
decimal expansion of the form 

X n — 0-X n lX n 2X n ^X n ^X n ^ . . . 



162 



Countable and Uncountable Sets 



where x nm is the m number in the decimal expansion of x n . Of course some real numbers 
have two distinct decimal expansions, a terminating one and a non-terminating one. For 
example, 1.0000 • • • = 0.9999 . . .0 We will choose the non-terminating decimal expansions for 
our x n . 

We now create a new real number x from the x n by a process called diagonalization. We 
choose a n G {1, 2, . . . , 9} such that the following hold: 

ai / xn,a 2 / x 2 2, o 3 / x 33 , . . . , a n / x nn , . . . 

To avoid a situation where we obtain a number x with a terminating decimal expansion, we 
haven't permitted a n = 0; this is just a technicality. We can now define x: Put 

x = 0.01020304 . . . 

Here comes the heart of the argument: Clearly x € I = [0,1]. Now if / can be written as a 
list {xi, X2, x 3 , . . . }, then there must be some n such that x = x n . But the first decimal place 
of x differs from the first decimal place of x, since a\ 7^ x\\\ hence x 7^ x\. Similarly, the 
second decimal place of x differs from the second decimal place of x 2 , since a 2 7^ x 22 ; hence 
x 7^ x 2 . We can continue in this way to show that i/i„ for any n E N, i.e. x is not on the 
list {x 1 ,x 2 ,x 3 , ...}. 

This proves the result! Given any list x\, x 2 , x 3 , . . . of real numbers in [0, 1], we now have 
a technique for producing a new real number x that is not on the list. It thus follows that no 
such list can contain all the real numbers in [0, 1], i.e. there is no bijection from N to [0, 1]. 

□ 

Hence there are uncountable sets. Clearly M is also uncountable, because otherwise we 
could find an enumeration {r\, r 2 , r 3 , . . . } of M. By omitting any reals which are not in [0, 1], 
we could prune this into an enumeration of [0, 1]. 

Exercise A. 3. 8 Show that if A is any set, then \A\ ^ \V(A)\. Conclude that V(N) is 
uncountable. (Actually, it can be proved that |R| = l'P(N)!.) 

[Hint: Suppose that / : A — > V(A) and consider the set B := {a G A : a /(o)}. By 
contradiction, show that B im/.] 

□ 

Consider that 

1. Q satisfies all the axioms that R does, except for the Completeness Axiom; 

2. Q is countable, but R is uncountable. 

This juxtaposition leads one to suspect that it is the Completeness Axiom which is responsible 
for the uncountability of R. This is indeed the case, as you will see by proving the following 
proposition: 

Proposition A. 3. 9 Let I be a non-empty bounded interval of real numbers, and let {a n : 

n £ N} C /. Then there is a p £ I such that p 7^ a n for all n. 

In particular, I 7^ {a n : n G N} for any sequence a n . Hence I is uncountable. 

2 An easy way to see this is to note that 1 = 3 x | = 3(0.333 . . . ) = 0.999 .... 



Sets and Logic 



163 



Exercise A. 3. 10 We prove Propn. |A.3.9 



(a) Explain why there is a closed interval I\ C I such that a% ^ I\. [Hint: Divide / into three 
closed subintervals of equal length.] 

(b) Explain why there is a closed subinterval 1% Q I\ such that 02 G" Ii- Exlain why also 

(c) Now assume that we have found a closed subinterval /„, such that a\ , . . . , a n G" I n . Explain 
why there is a closed subinterval I n+ i C I n such that a n+ \ I n +i- Explain also why we 
now have 01, a 2 , ■ . . , a n+ \ G" I n +i- 

(d) We now have constructed a sequence of closed intervals 

h 2 h 2 / 3 • • • 2 In 2 • ■ • 

Explain why f|„ eN 4 / 0- 

[Hint: This uses the Completeness Axiom: Let Z n be the left endpoint of I n . Show that 
sup{/ n :neN) exists, and that sup{/„, : n G N} G HneN^-] 

(e) Let p G flneN ^n- Explain why p G I, and why p ^ a n for any n G N. 

□ 



